Geocoding based on the City of Houston GIS address and location file
This package uses a lightly cleaned version of the Houston GIS file to try to match addresses and produce lat long locations.
The package assumes that the input GIS file is perfect, so any errors that remain in that file will be propagated. I believe, too, that the lat longs in that file generally represent either building or plat center locations, similar to Google Maps.
As a rule, the various routines will return a tibble with the Latitude, Longitude, a success flag (T/F), and a fail code describing why the match failed, if indeed it did.
Inputs are usually the separated components of the address, the street number, street direction (NSEW), the street name, street type (ST, RD, DR, ...), and the Zip code. The very useful package postmastr by Christopher Prener is recommended for parsing address data.
An example of using this code on a large permit file from the city of Houston may be found at https://github.com/alankjackson/Curated_Data in the file Clean_City_Permit_Addresses.Rmd .
I recommend running the routines in this order.
Match exactly looks for an exact match to the entire address. In my tests this was successful about 90% of the time. It successively looks for failures in the zipcode, street name, street type, street number, and finally the prefix. If any one of these fails, the rest are not checked.
Exact_match <- NULL
Failed_match <- NULL
for (i in 1:nrow(foo)){ # first look for exact matches
tmp <- match_exactly(foo[i,]$pm.house, foo[i,]$pm.preDir, foo[i,]$pm.street,
foo[i,]$pm.streetSuf, foo[i,]$Zipcode)
if (tmp$Success){ # success
Exact_match <- cbind(foo[i,], tmp) %>%
select(pm.id, pm.house, pm.preDir, pm.street, pm.streetSuf, Zipcode, Lat, Lon) %>%
rbind(., Exact_match)
} else { # Fail exact match
Failed_match <- cbind(foo[i,], tmp) %>%
select(pm.id, pm.house, pm.preDir, pm.street, pm.streetSuf, Zipcode, Fail,
Lat, Lon) %>%
rbind(., Failed_match)
}
}
saveRDS(Exact_match, paste0(savepath, "Keep_Exactmatch.rds"))
saveRDS(Failed_match, paste0(savepath, "Keep_Failedmatch.rds"))
Here, I first look to see if the street name exists in the original zip code. If so, then I bail - too many ways that could go bad. If the original zip code does not contain the street name, then I look for an exact match for the address in any zip code. If only one match occurs, then I call that good and return the successful zip code. Note that in all of the repair routines, the proposed correction is returned separately so that it can be eye-balled.
# add a field to hold the corrected data if any
Failed_match <- Failed_match %>% mutate(Correction=NA)
for (i in 1:nrow(Failed_match)){
target <- Failed_match[i,]
tmp <- repair_zipcode(target$pm.house, target$pm.preDir, target$pm.street,
target$pm.streetSuf, target$Zipcode)
if (tmp$Success){ # success
Failed_match[i,]$Lat <- tmp$Lat
Failed_match[i,]$Lon <- tmp$Lon
Failed_match[i,]$Fail <- paste(Failed_match[i,]$Fail, "Zipcode")
Failed_match[i,]$Correction <- tmp$New_zipcode
}
}
Here we will restrict our search to names in the given zip code, and look for names that are close to the target name using the generalized Levenshtein (edit) distance. The default edit distance is 2, but this is a parameter that can be played with. We will also ignore street names that are likely to fail badly, like :
for (i in 1:nrow(Failed_match)){
if (Failed_match[i,]$Fail!="Street_name") {next} # skip if name isn't the issue
if (Failed_match[i,]$Lat > 0) {next} # skip if zipcode resolved it
target <- Failed_match[i,]
tmp <- repair_name(target$pm.house, target$pm.preDir, target$pm.street,
target$pm.streetSuf, target$Zipcode)
if (tmp$Success){ # success
Failed_match[i,]$Lat <- tmp$Lat
Failed_match[i,]$Lon <- tmp$Lon
Failed_match[i,]$Correction <- tmp$New_name
} else {
Failed_match[i,]$Fail <- paste("Street_name",tmp$Fail)
}
}
We will search in the provided zip code for streets that match the name, and see what the available types are. If we find only one, then we are successful and done. If there are more than one, we will look at the full address, and if only one of those matches, we will again declare victory.
We will ignore types that are too likely to fail, "CT", and "CIR".
for (i in 1:nrow(Failed_match)){
if (Failed_match[i,]$Fail!="Street_type") {next} # skip if type isn't the issue
target <- Failed_match[i,]
tmp <- repair_type(target$pm.house, target$pm.preDir, target$pm.street,
target$pm.streetSuf, target$Zipcode)
if (tmp$Success){ # success
Failed_match[i,]$Lat <- tmp$Lat
Failed_match[i,]$Lon <- tmp$Lon
Failed_match[i,]$Correction <- tmp$New_type
} else {
Failed_match[i,]$Fail <- paste("Street_type",tmp$Fail)
}
}
This basically amounts to a careful interpolation. First we gather all the numbers for the given street prefix, name, type, and zip code. We toss any that are more than about 2 blocks away. We honor the odd/even address convention and only look at points that should be on the same side of the street.
If we end up with 3 or more points available to use to interpolate, calculate a (delta lat/long)/(delta house number) and use that to interpolate from the house number nearest the target. As a final check. if that is greater than 250 feet, then reject the interpolation. That reject distance is selectable.
for (i in 1:nrow(Failed_match)){
if (Failed_match[i,]$Fail!="Street_num") {next} # skip if num isn't the issue
target <- Failed_match[i,]
tmp <- repair_number(target$pm.house, target$pm.preDir, target$pm.street,
target$pm.streetSuf, target$Zipcode)
if (tmp$Success){ # success
Failed_match[i,]$Lat <- tmp$Lat
Failed_match[i,]$Lon <- tmp$Lon
Failed_match[i,]$Correction <- tmp$Result
} else {
Failed_match[i,]$Fail <- paste("Street_num",tmp$Fail)
}
}
Similar to repairing type, look for all the possible prefixes that match the zip code, name, and type for the target. If there is only one, then we are done. If there is not just one, then we fail.
for (i in 1:nrow(Failed_match)){
if (Failed_match[i,]$Fail!="Prefix") {next} # skip if name isn't the issue
target <- Failed_match[i,]
tmp <- repair_prefix(target$pm.house, target$pm.preDir, target$pm.street,
target$pm.streetSuf, target$Zipcode)
if (tmp$Success){ # success
Failed_match[i,]$Lat <- tmp$Lat
Failed_match[i,]$Lon <- tmp$Lon
Failed_match[i,]$Correction <- tmp$New_prefix
} else {
Failed_match[i,]$Fail <- paste("Prefix",tmp$Fail)
}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.