parallel.regex_addr: String replacement cleaning of a large data frame of NYC...

Description Usage Arguments Value Examples

View source: R/parallel.regex_addr.R

Description

The parallel.regex_addr function is a more efficient way to perform string replacement cleaning on a large data frame of NYC addresses (+10,000 records) with a look-up dataset of locations using parallel processing. The locations dataset was constructed from NYC Department of City Planning's (DCP) PAD (Property Address Directory) and SND (Street Name Dictionary). In addition, the function attempts to reconcile addresses containing post office box information or indicators of missing addresses (e.g., "UNKNOWN", "HOMELESS").

Usage

1
2
parallel.regex_addr(in_clus, in_df, new_addr_col_name,
    addr2_col_name = NULL)

Arguments

in_clus

the number of clusters available to the function as integer. Required.

in_df

a data frame containing NYC addresses. Required.

new_addr_col_name

the name of output addresses column as string. Required.

addr1_col_name

the name of the input address line one column as string. Required.

addr2_col_name

the name of the input address line two column as string. Optional.

Value

A data frame containing the input data frame plus the cleaned address column.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# create a data frame of addresses
ADDR1 <- c(paste(1:5000,"BROADWAY"),paste(1:2400,"1"),
    paste(1:3400,"ATLANTIC A"), paste(1:3400,"FULTON S"), paste(1:4000,"NOSTRA"))
ADDR2 <- ifelse(grepl(" 1$",ADDR1),"AVE","ROOM 123")
BORO_CODE <- ifelse(grepl("ATLANT|FULTON|NOSTRA",ADDR1),3,1)
u_id <- 1:length(ADDR1)
df = data.frame(u_id, ADDR1, ADDR2, BORO_CODE)

#get version of DCP PAD used to build package data
rNYCclean::pad_version

#get number of records
nrow(df)

#one address input column
system.time({df1 <- parallel.regex_addr(in_clus=2, in_df = df, 
    new_addr_col_name = "ADDR.regex", addr1_col_name = "ADDR1")})

#preview records
head(df1)

#two address input column
system.time({df2 <- parallel.regex_addr(in_clus=2, in_df = df, 
    new_addr_col_name = "ADDR.regex", addr1_col_name = "ADDR1", 
    addr2_col_name = "ADDR2")})

#preview records
head(df2)

gmculp/rNYCclean documentation built on July 14, 2020, 5:07 a.m.