parallel.splchk_addr: Spell check a large data frame of NYC addresses.

Description Usage Arguments Value Examples

View source: R/parallel.splchk_addr.R

Description

The parallel.splchk_addr function is a more efficient way to perform a spell check on a large data frame of NYC addresses (+10,000 records) with a street name dictionary built from NYC Department of City Planning's (DCP) PAD (Property Address Directory) and SND (Street Name Dictionary) using parallel processing.

Usage

1
2
parallel.splchk_addr(in_clus, in_df, new_addr_col_name, addr_col_name,
  third_col_name, third_col_type)

Arguments

in_clus

the number of clusters available to the function as integer. Required.

in_df

a data frame containing NYC addresses. Required.

new_addr_col_name

the name of output addresses column as string. Required.

addr_col_name

the name of the input addresses column as string. Required.

third_col_name

the name of either the borough code or zip code column as string. Required.

third_col_type

either "boro_code" or "zip_code" as string. Required.

Value

A data frame containing the input data frame plus the spell checked address column.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# create a data frame of addresses
ADDR <- c(paste(1:5000,"BRODWAY"),paste(1:2400,"1 AVNUE"),
    paste(1:3400,"ATLANTAC AVE"),paste(1:3400,"FULTAIN ST"))
BORO_CODE <- ifelse(grepl("ATLANT|FULT",ADDR),3,1)
u_id <- 1:length(ADDR)
df = data.frame(u_id, ADDR, BORO_CODE)

#get version of DCP PAD used to build package data
rNYCclean::pad_version

#get number of records
nrow(df)

#spell check address column using borough code
df1 <- parallel.splchk_addr(in_clus = 10, in_df = df,   
    new_addr_col_name="ADDR.splchk", addr_col_name="ADDR", 
    third_col_name="BORO_CODE", third_col_type="boro_code")

#preview records
head(df1)

gmculp/rNYCclean documentation built on July 14, 2020, 5:07 a.m.