parallel.regex_addr: String replacement cleaning of a large data frame of NYC...
In gmculp/rNYCclean: A package to clean NYC addresses

Description Usage Arguments Value Examples

View source: R/parallel.regex_addr.R

The parallel.regex_addr function is a more efficient way to perform string replacement cleaning on a large data frame of NYC addresses (+10,000 records) with a look-up dataset of locations using parallel processing. The locations dataset was constructed from NYC Department of City Planning's (DCP) PAD (Property Address Directory) and SND (Street Name Dictionary). In addition, the function attempts to reconcile addresses containing post office box information or indicators of missing addresses (e.g., "UNKNOWN", "HOMELESS").

1 2	parallel.regex_addr(in_clus, in_df, new_addr_col_name, addr2_col_name = NULL)

`in_clus`	the number of clusters available to the function as integer. Required.
`in_df`	a data frame containing NYC addresses. Required.
`new_addr_col_name`	the name of output addresses column as string. Required.
`addr1_col_name`	the name of the input address line one column as string. Required.
`addr2_col_name`	the name of the input address line two column as string. Optional.

A data frame containing the input data frame plus the cleaned address column.

# create a data frame of addresses
ADDR1 <- c(paste(1:5000,"BROADWAY"),paste(1:2400,"1"),
    paste(1:3400,"ATLANTIC A"), paste(1:3400,"FULTON S"), paste(1:4000,"NOSTRA"))
ADDR2 <- ifelse(grepl(" 1$",ADDR1),"AVE","ROOM 123")
BORO_CODE <- ifelse(grepl("ATLANT|FULTON|NOSTRA",ADDR1),3,1)
u_id <- 1:length(ADDR1)
df = data.frame(u_id, ADDR1, ADDR2, BORO_CODE)

#get version of DCP PAD used to build package data
rNYCclean::pad_version

#get number of records
nrow(df)

#one address input column
system.time({df1 <- parallel.regex_addr(in_clus=2, in_df = df, 
    new_addr_col_name = "ADDR.regex", addr1_col_name = "ADDR1")})

#preview records
head(df1)

#two address input column
system.time({df2 <- parallel.regex_addr(in_clus=2, in_df = df, 
    new_addr_col_name = "ADDR.regex", addr1_col_name = "ADDR1", 
    addr2_col_name = "ADDR2")})

#preview records
head(df2)