knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
library(envReport) library(envImport) library(magrittr)
The goal of envImport is to obtain, and make seamlessly useable, environmental data from disparate data sources, for a geographic area of interest.
You can install the development version of envImport from GitHub with:
# install.packages("devtools") devtools::install_github("dew-landscapes/envImport")
data_name
= 'data source'. Data sources are (usually) obvious sources of data. Examples are the Global Biodiversity Infrastructure Facility (GBIF), Atlas of Living Australia (ALA) or Terrestrial Ecosystems Network (TERN). There are r ncol(envImport::data_map) - 3
data sources currently supported (also see envImport::data_map
):
r paste0(names(data_map)[4:ncol(data_map)], ": ", data_map[data_map$col == "desc", 4:ncol(data_map)], collapse = "\n* ")
Five of these sources are publicly available (GBIF, ALA, OBIS, HAVPlot and TERN).
data_map
The data_map (see table below) provides a mapping from original data sources to the desired columns in the assembled data set.
dm <- data_map %>% dplyr::select(col, gbif, tern, galah, havplot) %>% dplyr::filter() knitr::kable(dm , caption = "Data map of desired columns in the assembled data (col) and names of columns in the original data. Where a column name from the original data source does not match columns in the original data source, the get_x function has usually created a new column to better meet the requirements of the final combined data set" )
get_x
get_x
functions get data from the data source x
. Results are always saved to disk (as getting data can be slow). When run again, they load from the saved file by default. If available, get_x
functions use any R packages and functions provided by the data source (e.g. TERN provides ausplotsR
r cite_package("ausplotsR")
). The first arguments to get_x
functions are always:
aoi
: an area of interest, provided as simple feature (see sf::sf()
)save_dir
: a directory to save the results to. The default (NULL
) leads to the file here::here("out", "ds", "x.rds")
being created and used as save_file
. ds
is for 'data source'. While the saved file is usually x.rds
, in some instances it follows the format and naming of the download from x
(e.g. GBIF data comes in a .zip
file named by the corresponding download key)get_new
: an override to force get_x
to requery the data source, even if save_file already exists...
: the dots are passed to any underlying 'native' function, such as rgbif::occ_download()
, galah::galah_call()
or ausplotsR::get_ausplots()
Only the get_x
functions for publicly available data are available within envImport.
Within get_x
functions the following steps are taken:
occ
(is this a presence [1
] or absence [0
] record?), month
and year
get_x
functions can be run from get_data
.
No specific functions are provided for combining data. The following are possible (assuming 'files' is a vector of file names resulting from get_x
):
purrr::map_dfr(files, \(x) rio::import(x, setclass = "tibble")
arrow::open_dataset(files, unify_schema = TRUE) %>% dplyr::collect()
rio::import
is possibly more robust to differences in schema when importing files (based on observation - needs testing).
envImport
does not clean data. Any combined dataset is likely to contain all sorts of duplication and other spurious records. For help cleaning data, see, for example:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.