Description Usage Arguments Value Aggregates (agg) Analysis Groups (by) Filtering (subset) Examples
View source: R/summarize_data.R
Create weighted aggregate tables using NHTS data.
1 2 3 |
data |
Object returned by read_data. |
agg |
Aggregate function label. Either "household_count", "person_count", "trip_count", "sum", "avg", "median", "household_trip_rate", or "person_trip_rate". See Aggregates section |
agg_var |
Character string specifying a numeric variable over which to aggregate. Only relavent when agg is "avg" or "sum" |
by |
Character vector of one or more variable names to group by. See Analysis Groups section. |
subset |
Character string containing a pre-aggregation subset condition using data.table syntax. See Filter section. |
label |
logical. Use labels for table output? |
prop |
logical. Use proportions for count aggregates? |
prop_by |
Character vector of one or more variable names by which to group proportions. |
exclude_missing |
logical. Exclude missing responses from summary. |
data.table object aggregated by input specifications containing the following fields:
by
variables. For each by
variable, a column of the same name is created.
They will appear in the order they are listed as factors ordered by their codebook values.
W - Weighted statistic.
E - Standard error of the weighted statistic.
S - Surveyed/sampled statistic.
N - Number of observations/sample size.
agg
)What type of aggregate are you interested in?
household_count - Count of households
person_count - Count of persons
trip_count - Count of trips
vehicle_count - Count of vehicles
*Use prop = TRUE
in combination with a count aggregate to get the proportion.
Must also specify a numeric aggregate variable using the agg_var
parameter.
sum - Sum of agg_var
avg - Arithmetic mean of agg_var
median - Median of agg_var
Simply put, the count of trips divided by the count of persons or households.
household_trip_rate - Daily trips per household.
person_trip_rate - Daily trips per person.
by
)By which variables to you wish to aggregate?
Similar to GROUP BY
in SQL or a CLASS
statement in SAS.
There is no limit to the number of variables specified in the character vector, however many by
variables
can result in groups with small sample sizes which need to be interpreted carefully.
The data.table returned by summarize_data will include a column (of class factor) for each by
variable specified.
subset
)Which households/person/trips do you wish to include or exclude?
Similar to WHERE
in SQL, subset
allows you to filter observations/rows in the dataset before summarizing/aggregating.
subset
is a string that will be evaluated as a logical vector indicating the rows to keep.
As mentioned above, the string will be evaluated as the i
index in a data.table.
In short, similar to the base function subset,
there is no need to specify the data object in which the variables are included
(i.e.: your code would look like "var < 10"
instead of "data$var < 10"
).
Any variable (or combination of variables) found in the codebook can be used in the subset condition. See Logic for a refresher on R's logical operators when using more than one logical condition.
You will frequently need to include quotes in your string. You can tackle this a few different ways. The following examples would all evaluate the same way:
"HHSTATE %in% c('GA','FL')"
'HHSTATE %in% c("GA","FL")'
"HHSTATE %in% c(\"GA\",\"FL\")"
1 2 3 4 5 6 7 8 9 | # Read 2009 NHTS data with specified csv path:
nhts_data <- read_data('2009', csv_path = 'C:/NHTS')
summarize_data(
data = nhts_data, # Using the nhts_data object,
agg = 'person_trip_rate', # calculate the person trip rate
by = 'WORKER', # by worker status
subset = 'CENSUS_R == "01"' # for households in the NE Census region
)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.