summarize_data: Summarize NHTS Data

Description Usage Arguments Value Aggregates (agg) Analysis Groups (by) Filtering (subset) Examples

View source: R/summarize_data.R

Description

Create weighted aggregate tables using NHTS data.

Usage

1
2
3
summarize_data(data, agg, agg_var = NULL, by = NULL, subset = NULL,
  label = TRUE, prop = FALSE, prop_by = NULL,
  exclude_missing = FALSE)

Arguments

data

Object returned by read_data.

agg

Aggregate function label. Either "household_count", "person_count", "trip_count", "sum", "avg", "median", "household_trip_rate", or "person_trip_rate". See Aggregates section

agg_var

Character string specifying a numeric variable over which to aggregate. Only relavent when agg is "avg" or "sum"

by

Character vector of one or more variable names to group by. See Analysis Groups section.

subset

Character string containing a pre-aggregation subset condition using data.table syntax. See Filter section.

label

logical. Use labels for table output?

prop

logical. Use proportions for count aggregates?

prop_by

Character vector of one or more variable names by which to group proportions.

exclude_missing

logical. Exclude missing responses from summary.

Value

data.table object aggregated by input specifications containing the following fields:

Aggregates (agg)

What type of aggregate are you interested in?

Frequencies / Proportions

*Use prop = TRUE in combination with a count aggregate to get the proportion.

Numeric Aggregates (Sum / Average / Median)

Must also specify a numeric aggregate variable using the agg_var parameter.

Trip Rates (Daily Person Trips per Person/Household)

Simply put, the count of trips divided by the count of persons or households.

Analysis Groups (by)

By which variables to you wish to aggregate?

Similar to GROUP BY in SQL or a CLASS statement in SAS. There is no limit to the number of variables specified in the character vector, however many by variables can result in groups with small sample sizes which need to be interpreted carefully.

The data.table returned by summarize_data will include a column (of class factor) for each by variable specified.

Filtering (subset)

Which households/person/trips do you wish to include or exclude?

Similar to WHERE in SQL, subset allows you to filter observations/rows in the dataset before summarizing/aggregating.

subset is a string that will be evaluated as a logical vector indicating the rows to keep. As mentioned above, the string will be evaluated as the i index in a data.table. In short, similar to the base function subset, there is no need to specify the data object in which the variables are included (i.e.: your code would look like "var < 10" instead of "data$var < 10").

Any variable (or combination of variables) found in the codebook can be used in the subset condition. See Logic for a refresher on R's logical operators when using more than one logical condition.

Quoting within quotes

You will frequently need to include quotes in your string. You can tackle this a few different ways. The following examples would all evaluate the same way:

Examples

1
2
3
4
5
6
7
8
9
# Read 2009 NHTS data with specified csv path:
nhts_data <- read_data('2009', csv_path = 'C:/NHTS')

summarize_data(
  data = nhts_data,           # Using the nhts_data object,
  agg = 'person_trip_rate',   # calculate the person trip rate
  by = 'WORKER',              # by worker status
  subset = 'CENSUS_R == "01"' # for households in the NE Census region
)

Westat-Transportation/summarizeNHTS documentation built on May 17, 2020, 8:57 p.m.