parallel.seqsplt_addr: Split address strings in a large data frame into sequential...

Description Usage Arguments Value Examples

View source: R/parallel.seqsplt_addr.R

Description

The parallel.seqsplt_addr function is a more efficient way to split address strings in a large data frame (+10,000 records) into sequential combinations of words using parallel processing.

Usage

1
2
parallel.seqsplt_addr(in_clus, in_df, new_addr_col_name, id_col_name,
  addr_col_name, third_col_name, remove_orig = TRUE)

Arguments

in_clus

the number of clusters available to the function as integer. Required.

in_df

a data frame containing addresses. Required.

new_addr_col_name

the name of output addresses column as string. Required.

id_col_name

the name of the unique identifier column as string. Required.

addr_col_name

the name of the input addresses column as string. Required.

third_col_name

the name of either the borough code or zip code column as string. Required.

remove_orig

option to exclude original address from output as binary. Optional.

Value

A data frame containing id_col_name, third_col_name, and a column of address strings split into sequential combinations of words.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# create a data frame of addresses
ADDR <- c("ROOM 326 125 WORTH STREET","253 BROADWAY FLR 3",
    "C/O DOHMH 42-09 28 STREET")
BORO_CODE <- c(1,1,4)
u_id <- 1:length(ADDR)
df = data.frame(u_id, ADDR, BORO_CODE)

#split address column into sequential combinations 
df1 <- parallel.seqsplt_addr(in_clus = 1,in_df = df, 
    new_addr_col_name = "ADDR.seqsplt", id_col_name = "u_id", 
    addr_col_name = "ADDR", third_col_name = "BORO_CODE")

#preview records
head(df1)

gmculp/rNYCclean documentation built on July 14, 2020, 5:07 a.m.