coords: Coordinate based cleaning

coordsR Documentation

Coordinate based cleaning

Description

Coordinate based cleaning

Usage

coord_incomplete(x, lat = NULL, lon = NULL, drop = TRUE)

coord_imprecise(x, which = "both", lat = NULL, lon = NULL, drop = TRUE)

coord_impossible(x, lat = NULL, lon = NULL, drop = TRUE)

coord_unlikely(x, lat = NULL, lon = NULL, drop = TRUE)

coord_within(
  x,
  field = NULL,
  country = NULL,
  lat = NULL,
  lon = NULL,
  drop = TRUE
)

coord_pol_centroids(x, lat = NULL, lon = NULL, drop = TRUE)

coord_uncertain(
  x,
  coorduncertainityLimit = 30000,
  drop = TRUE,
  ignore.na = FALSE
)

Arguments

x

(data.frame) A data.frame

lat, lon

(character) Latitude and longitude column to use. See Details.

drop

(logical) Drop bad data points or not. Either way, we parse out bad data points as an attribute you can access. Default: TRUE

which

(character) one of "has_dec", "no_zeros", or "both" (default)

field

(character) Name of field in input data.frame x with country names

country

(character) A single country name

coorduncertainityLimit

(numeric) numeric threshold for the coordinateUncertainityInMeters variable. Default: 30000

ignore.na

(logical) To consider NA values as a bad point or not. Default: FALSE

Details

Explanation of the functions:

  • coord_impossible - Impossible coordinates

  • coord_incomplete - Incomplete coordinates

  • coord_imprecise - Imprecise coordinates

  • coord_pol_centroids - Points at political centroids

  • coord_unlikely - Unlikely coordinates

  • coord_within - Filter points within user input political boundaries

  • coord_uncertain - Uncertain occurrances of measured through coordinateUncertaintyInMeters default limit= 30000

If either lat or lon (or both) given, we assign the given column name to be standardized names of "latitude", and "longitude". If not given, we attempt to guess what the lat and lon column names are and assign the same standardized names. Assigning the same standardized names makes downstream processing easier so that we're dealing with consistent column names. On returning the data, we return the original names.

For coord_within, we use countriesLow dataset from the rworldmap package to get country borders.

Value

Returns a data.frame, with attributes

coord_pol_centroids

Right now, this function only deals with city centroids, using the maps::world.cities dataset of more than 40,000 cities. We'll work on adding country centroids, and perhaps others (e.g., counties, states, provinces, parks, etc.).

Examples

df <- sample_data_1

# Remove impossible coordinates
NROW(df)
df[1, "latitude"] <- 170
df <- dframe(df) %>% coord_impossible()
NROW(df)
attr(df, "coord_impossible")

# Remove incomplete cases
NROW(df)
df_inc <- dframe(df) %>% coord_incomplete()
NROW(df_inc)
attr(df_inc, "coord_incomplete")


# Remove imprecise cases
df <- sample_data_5
NROW(df)
## remove records that don't have decimals at all
df_imp <- dframe(df) %>% coord_imprecise(which = "has_dec")
NROW(df_imp)
attr(df_imp, "coord_imprecise")
## remove records that have all zeros
df_imp <- dframe(df) %>% coord_imprecise(which = "no_zeros")
NROW(df_imp)
attr(df_imp, "coord_imprecise")
## remove both records that don't have decimals at all and those that
## have all zeros
df_imp <- dframe(df) %>% coord_imprecise(which = "both")
NROW(df_imp)
attr(df_imp, "coord_imprecise")


# Remove unlikely points
NROW(df)
df_unlikely <- dframe(df) %>% coord_unlikely()
NROW(df_unlikely)
attr(df_unlikely, "coord_unlikely")

# Remove points not within correct political borders
if (requireNamespace("rgbif", quietly = TRUE) && interactive()) {
   library("rgbif")
   wkt <- 'POLYGON((30.1 10.1,40 40,20 40,10 20,30.1 10.1))'
   res <- rgbif::occ_data(geometry = wkt, limit=300)$data
} else {
   res <- sample_data_4
}

## By specific country name
if (
  interactive() &&
  requireNamespace("sf", quietly=TRUE) && 
  requireNamespace("s2", quietly=TRUE) && 
  requireNamespace("rworldmap", quietly=TRUE)
) {
NROW(res)
df_within <- dframe(res) %>% coord_within(country = "Israel")
NROW(df_within)
attr(df_within, "coord_within")

## By a field in your data - makes sure your points occur in one
## of those countries
NROW(res)
df_within <- dframe(res) %>% coord_within(field = "country")
NROW(df_within)
head(df_within)
attr(df_within, "coord_within")
}

# Remove those very near political centroids
## not ready yet
# NROW(df)
# df_polcent <- dframe(df) %>% coord_pol_centroids()
# NROW(df_polcent)
# attr(df_polcent, "coord_polcent")

## lat/long column names can vary
df <- sample_data_1
head(df)
names(df)[2:3] <- c('mylon', 'mylat')
head(df)
df[1, "mylat"] <- 170
dframe(df) %>% coord_impossible(lat = "mylat", lon = "mylon")

df <- sample_data_6

# Remove uncertain occurances

NROW(df)
df1<-df %>% coord_uncertain()
NROW(df1)
attr(df, "coord_uncertain")

NROW(df)
df2<-df %>% coord_uncertain(coorduncertainityLimit = 20000)
NROW(df2)

NROW(df)
df3<-df %>% coord_uncertain(coorduncertainityLimit = 20000,ignore.na=TRUE)
NROW(df3)

ropenscilabs/scrubr documentation built on Sept. 12, 2022, 4:10 p.m.