pm_parse: Parse Street Addresses

Description Usage Arguments Details Value Examples

View source: R/parse.R

Description

A wrapper around the parse functions that can be used to shorten all of postmastr's core code down to a single function call once dictionaries have been created and tested against the data.

Usage

1
2
3
4
5
pm_parse(.data, input, address, output, new_address, ordinal = TRUE,
    operator = "at", unnest = FALSE, include_commas = FALSE, include_units = TRUE,
    keep_parsed = "no", side = "right", left_vars, keep_ids = FALSE, houseSuf_dict,
    dir_dict, street_dict, suffix_dict, unit_dict, city_dict, state_dict,
    locale = "us")

Arguments

.data

A source data set to be parsed

input

Describes the format of the source address. One of either "full" or "short". A short address contains, at the most, a house number, street directionals, a street name, a street suffix, and a unit type and number. A full address contains all of the selements of a short address as well as a, at the most, a city, state, and postal code.

address

A character variable containing address data to be parsed

output

Describes the format of the output address. One of either "full" or "short". A short address contains, at the most, a house number, street directionals, a street name, a street suffix, and a unit type and number. A full address contains all of the selements of a short address as well as a, at the most, a city, state, and postal code.

new_address

Name of new variable to store rebuilt address in.

ordinal

A logical scalar; if TRUE, street names that contain numeric words values (i.e. "Second") will be converted and standardized to ordinal values (i.e. "2nd"). The default is TRUE because it returns much more compact clean addresses (i.e. "168th St" as opposed to "One Hundred Sixty Eigth St").

operator

A character scalar to be used as the intersection operator (between the 'x' and 'y' sides of the intersection).

unnest

A logical scalar; if TRUE, house ranges will be unnested (i.e. a house range that has been expanded to cover four addresses with pm_houseRange_parse will be converted from a single observation to four observations, one for each house number). If FALSE (default), the single observation will remain.

include_commas

A logical scalar; if TRUE, a comma is added both before and after the city name in rebuild addresses. If FALSE (default), no punctuation is added.

include_units

A logical scalar; if TRUE (default), the unit name and number (if given) will be included in the output string. Otherwise if FALSE, the unit name and number will not be included.

keep_parsed

Character string; if "yes", all parsed elements will be added to the source data after replacement. If "limited", only the pm.city, pm.state, and postal code variables will be retained. Otherwise, if "no", only the rebuilt address will be added to the source data (default).

side

One of either "left" or "right" - should parsed data be placed to the left or right of the original data? Placing data to the left may be useful in particularly wide data sets.

left_vars

A character scalar or vector of variables to place on the left-hand side of the output when side is equal to "middle".

keep_ids

Logical scalar; if TRUE, the identification numbers will be kept in the source data after replacement. Otherwise, if FALSE, they will be removed (default).

houseSuf_dict

Optional; name of house suffix dictionary object. Standardizationl and parsing are skipped if none is specified.

dir_dict

Optional; name of directional dictionary object. If none is specified, the full default directional dictionary will be used.

street_dict

Optional; name of street dictionary object. Standardizationl is skipped if none is specified.

suffix_dict

Optional; name of street suffix dictionary object. If none is specified, the full default street suffix dictionary will be used.

unit_dict

Optional; name of unit dictionary object - NOT CURRENTLY ENABLED

city_dict

Required for "full" addresses; name of city dictionary object.

state_dict

Optional; name of state dictionary object. If none is specified, the full default state dictionary will be used.

locale

A string indicating the country these data represent; the only current option is "us" but this is included to facilitate future expansion.

Details

This function does not currently return countries. If a country identifier is present in the data to be parsed, it will be trimmed off the address and not returned.

Value

An updated version of the source data with, at a minimum, a new variable containing standardized street addresses for each observation. Options allow for columns containing parsed elements to be returned as well.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# construct dictionaries
dirs <- pm_dictionary(type = "directional", filter = c("N", "S", "E", "W"), locale = "us")
sufs <- pm_dictionary(type = "suffix", locale = "us")
mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us")
cities <- pm_append(type = "city",
    input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", "St. Louis",
              "SAINT LOUIS", "Webster Groves"),
    output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))

# add example data
df <- sushi1

# identify
df <- pm_identify(df, var = address)

# temporary code to subset unit
df <- dplyr::filter(df, name != "Drunken Fish - Ballpark Village")

# parse, full output
pm_parse(df, input = "full", address = address, output = "full", keep_parsed = "no",
    dir_dict = dirs, suffix_dict = sufs, city_dict = cities, state_dict = mo)

# parse, short output
pm_parse(df, input = "full", address = address, output = "short", keep_parsed = "no",
    new_address = clean_address, dir_dict = dirs, suffix_dict = sufs,
    city_dict = cities, state_dict = mo)

chris-prener/postmastr documentation built on Dec. 13, 2020, 3:39 a.m.