#' @title Get 'observation' table
#' @description Download data from the 'observation' ("observacao") table of one or more datasets
#' published in [Data Repository of the Brazilian Soil](https://www.pedometria.org/febr/).
#' This table includes data such as latitude, longitude, date of observation, underlying geology,
#' land use and vegetation, local topography, soil classification, and much more.
#' @template data_template
#' @template metadata_template
#' @param febr.repo (optional) Defaults to the remote file directory of the Federal University of
#' Technology - Paraná at \url{https://cloud.utfpr.edu.br/index.php/s/Df6dhfzYJ1DDeso}.
#' Alternatively, a local directory path can be informed if the user has a local copy of the data
#' repository.
#' @param missing (optional) List with named sub-arguments indicating what should be done with an
#' observation missing spatial coordinates, `coord`, date of observation, `time`, or data on
#' variables, `data`. Options are `"keep"` (default) and `"drop"`.
#' @param standardization (optional) List with named sub-arguments indicating how to perform data
#' standardization.
#' \itemize{
#' \item `crs` Character string indicating the EPSG code of the coordinate reference system (CRS)
#' to which spatial coordinates should be transformed. For example, `crs = "EPSG:4674"`, i.e.
#' SIRGAS 2000, the standard CRS for Brazil. Defaults to `crs = NULL`, i.e. no transformation
#' is performed.
#' \item `time.format` Character string indicating how to format dates. For example,
#' \code{time.format = "\%d-\%m-\%Y"}, i.e. dd-mm-yyyy such as in 31-12-2001. Defaults to
#' `time.format = NULL`, i.e. no formatting is performed. See [base::as.Date()] for more
#' details.
#'
#' \item `units` Logical value indicating if the measurement unit(s) of the continuous variable(s)
#' should be converted to the standard measurement unit(s). Defaults to `units = FALSE`, i.e. no
#' conversion is performed. See [febr::dictionary()] for more information.
#'
#' \item `round` Logical value indicating if the values of the continuous variable(s) should be
#' rounded to the standard number of decimal places. Requires `units = TRUE`. Defaults to
#' `round = FALSE`, i.e. no rounding is performed. See [febr::dictionary()] for more
#' information.
#' }
#'
#' @param harmonization (optional) List with named sub-arguments indicating if and how to perform
#' data harmonization.
#' \itemize{
#' \item `harmonize` Logical value indicating if data should be harmonized. Defaults to
#' `harmonize = FALSE`, i.e. no harmonization is performed.
#'
#' \item `level` Integer value indicating the number of levels of the identification code of the
#' variable(s) that should be considered for harmonization. Defaults to `level = 2`. See
#' \sQuote{Details} for more information.
#' }
#' @details
#' \subsection{Default variables}{
#' Default variables (fields) present in the 'observation' table are as follows:
#' \itemize{
#' \item `dataset_id`. Identification code of the dataset in the FEBR to which an observation
#' belongs.
#' \item `evento_id_febr`. Identification code of an observation in a dataset.
#' \item `evento_data`. Date (dd-mm-yyyy) in which an observation was made.
#' \item `coord_datum`. EPSG code of the coordinate reference system.
#' \item `coord_longitude`. Longitude (deg) or easting (m).
#' \item `coord_latitude`. Latitude (deg) or northing (m).
#' \item `coord_precisao`. Precision with which the spatial coordinates were determined (m).
#' \item `coord_fonte`. Source of the spatial coordinates.
#' \item `pais_id`. Code (ISO 3166-1 alpha-2) of the county where an observation was made.
#' \item `estado_sigla`. Acronym of the Brazilian federative unit where an observation was made.
#' \item `municipio_nome`. Name of the Brazilian municipality where as observation was made.
#' \item `subamostra_quanti`. Number of sub samples taken (used to indicate composite sampling).
#' \item `amostra_area`. Sampling area (used to indicate areal or block sampling).
#' }
#' Further details about the content of the default variables (fields) can be found in
#' \url{https://docs.google.com/document/d/1Bqo8HtitZv11TXzTviVq2bI5dE6_t_fJt0HE-l3IMqM}
#' (in Portuguese).
#' }
#' \subsection{Harmonization}{
#' Data harmonization consists of converting the values of a variable determined using some method
#' *B* so that they are (approximately) equivalent to the values that would have been obtained if
#' the standard method *A* had been used instead. For example, converting carbon content values
#' obtained using a wet combustion method to the standard dry combustion method is data
#' harmonization.
#'
#' A heuristic data harmonization procedure is implemented in the __febr__ package. It consists of
#' grouping variables based on a chosen number of levels of their identification code. For example,
#' consider a variable with an identification code composed of four levels, `aaa_bbb_ccc_ddd`, where
#' `aaa` is the first level and `ddd` is the fourth level. Now consider a related variable,
#' `aaa_bbb_eee_fff`. If the harmonization is to consider all four coding levels (`level = 4`),
#' then these two variables will remain coded as separate variables. But if `level = 2`, then both
#' variables will be re-coded as `aaa_bbb`, thus becoming the same variable.
#' }
#'
#' @return A `list` of `data.frame`s or a `data.frame` with, possibly standardize or harmonized,
#' data of the chosen variable(s) of the chosen dataset(s).
#' @author Alessandro Samuel-Rosa \email{alessandrosamuelrosa@@gmail.com}
#' @seealso [febr::readFEBR()], [febr::layer()], [febr::dictionary()], [febr::unit()]
#' @export
#' @examples
#' if (interactive()) {
#' res <- observation(data.set = "ctb0013")
#'
#' # Download two data sets and standardize CRS
#' res <- observation(
#' data.set = paste("ctb000", 4:5, sep = ""),
#' variable = "taxon",
#' standardization = list(crs = "EPSG:4674"))
#'
#' # Try to download a data set that is not available yet
#' res <- observation(data.set = "ctb0020")
#'
#' # Try to download a non existing data set
#' #res <- observation(data.set = "ctb0000")
#'
# Try to read all files from local directory
# febr.repo <- "~/ownCloud/febr-repo/publico"
# febr.repo <- ifelse(dir.exists(febr.repo), febr.repo, NULL)
# res <- observation(data.set = "all", febr.repo = febr.repo)
#' }
####################################################################################################
observation <-
function(data.set, variable, stack = FALSE,
missing = list(coord = "keep", time = "keep", data = "keep"),
standardization = list(crs = NULL, time.format = NULL, units = FALSE, round = FALSE),
harmonization = list(harmonize = FALSE, level = 2),
progress = TRUE, verbose = TRUE, febr.repo = NULL) {
# OPÇÕES E PADRÕES
opts <- .opt()
std_cols <- opts$observation$std.cols()
dic <- dictionary(table = "observacao", active = TRUE)
#
# ARGUMENT CHECK ----
## data.set
if (missing(data.set)) {
stop("Argument 'data.set' is missing")
} else if (!is.character(data.set)) {
stop(paste0("Object of class ", class(data.set), " passed to 'data.set'"))
} else {
dataset_ids <- readIndex()[["dados_id"]]
if (data.set[1] != "all") {
idx_out <- data.set %in% dataset_ids
if (sum(idx_out) != length(data.set)) {
stop(paste0("Unknown value '", data.set[!idx_out], "' passed to 'data.set'"))
} else {
dataset_ids <- data.set
}
}
}
n_datasets <- length(dataset_ids)
## variable
if (!missing(variable) && !is.character(variable)) {
stop(paste0("object of class '", class(variable), "' passed to 'variable'"))
}
## stack
if (!is.logical(stack)) {
stop(paste0("object of class '", class(stack), "' passed to 'stack'"))
}
## missing
if (!missing(missing)) {
if (is.null(missing$coord)) {
missing$coord <- "keep"
} else if (!missing$coord %in% c("drop", "keep")) {
stop(paste0("unknown value '", missing$coord, "' passed to 'missing$coord'"))
}
if (is.null(missing$time)) {
missing$time <- "keep"
} else if (!missing$time %in% c("drop", "keep")) {
stop(paste0("unknown value '", missing$time, "' passed to 'missing$time'"))
}
if (is.null(missing$data)) {
missing$data <- "keep"
} else if (!missing$data %in% c("drop", "keep")) {
stop(paste0("unknown value '", missing$data, "' passed to 'missing$data'"))
}
}
## standardization
if (!missing(standardization)) {
if (is.null(standardization$crs)) {
standardization$crs <- NULL
} else if (!is.character(standardization$crs)) {
y <- class(standardization$crs)
stop(paste0("object of class '", y, "' passed to 'standardization$crs'"))
} else if (!toupper(standardization$crs) %in% opts$crs) {
y <- standardization$crs
stop(paste0("unknown value '", y, "' passed to 'standardization$crs'"))
}
if (is.null(standardization$time.format)) {
standardization$time.format <- NULL
} else if (!is.character(standardization$time.format)) {
y <- class(standardization$time.format)
stop(paste0("object of class '", y, "' passed to 'standardization$time.format'"))
}
if (is.null(standardization$units)) {
standardization$units <- FALSE
} else if (!is.logical(standardization$units)) {
y <- class(standardization$units)
stop(paste0("object of class '", y, "' passed to 'standardization$units'"))
}
if (is.null(standardization$round)) {
standardization$round <- FALSE
} else if (!is.logical(standardization$round)) {
y <- class(standardization$round)
stop(paste0("object of class '", y, "' passed to 'standardization$round'"))
}
if (is.null(standardization$units)) {
standardization$units <- FALSE
} else if (!is.logical(standardization$units)) {
y <- class(standardization$units)
stop(paste0("object of class '", y, "' passed to 'standardization$units'"))
}
if (is.null(standardization$round)) {
standardization$round <- FALSE
} else if (!is.logical(standardization$round)) {
y <- class(standardization$round)
stop(paste0("object of class '", y, "' passed to 'standardization$round'"))
}
}
## harmonization
if (!missing(harmonization)) {
if (is.null(harmonization$harmonize)) {
harmonization$harmonize <- FALSE
} else if (!is.logical(harmonization$harmonize)) {
y <- class(harmonization$harmonize)
stop(paste0("object of class '", y, "' passed to 'harmonization$harmonize'"))
}
if (is.null(harmonization$level)) {
harmonization$level <- 2
} else if (!.isNumint(harmonization$level)) {
y <- class(harmonization$level)
stop(paste0("object of class '", y, "' passed to 'harmonization$level'"))
}
}
## progress
if (!is.logical(progress)) {
stop(paste0("object of class '", class(progress), "' passed to 'progress'"))
}
## verbose
if (!is.logical(verbose)) {
stop(paste0("object of class '", class(verbose), "' passed to 'verbose'"))
}
## variable + stack || variable + harmonization
if (!missing(variable) && all(variable == "all")) {
if (stack) {
stop("data cannot be stacked when downloading all variables")
}
if (harmonization$harmonize) {
stop("data cannot be harmonized when downloading all variables")
}
}
## data.set + stack
if (stack && length(data.set) == 1 && data.set != "all") {
message("A single dataset is being downloaded... setting stack = FALSE")
stack <- FALSE
}
# PADRÕES
## Descarregar tabela com unidades de medida e número de casas decimais quando padronização é
## solicitada ou quando empilhamento é solicitado
if (standardization$units || stack) {
febr_stds <- .getStds()
febr_unit <- .readGoogleSheetCSV(sheet.name = "unidades")
}
## stack + stadardization
## Padronização não precisa ser feita no caso de descarregamento apenas das variáveis padrão
## Também não precisa ser feita no caso de variáveis de tipo 'texto'
if (stack && !standardization$units && !missing(variable) && variable != "all") {
tmp_var <- paste("^", variable, sep = "")
idx <- lapply(tmp_var, function(pattern) grep(pattern = pattern, x = febr_stds$campo_id))
idx <- unlist(idx)
is_all_text <- all(febr_stds$campo_tipo[idx] == "texto")
if (!is_all_text) {
stop("Data cannot be stacked when measurement units are not standardized")
}
}
# Descarregar planilhas com observações
if (progress) {
pb <- utils::txtProgressBar(min = 0, max = n_datasets, style = 3)
}
res <- list()
for (i in seq_along(dataset_ids)) {
# Informative messages
dts <- dataset_ids[i]
if (verbose) {
par <- ifelse(progress, "\n", "")
message(paste(par, "Reading ", dts, "-observacao...", sep = ""))
}
# DESCARREGAMENTO
tmp <- .readFEBR(
data.set = dataset_ids[i], data.table = "observacao", febr.repo = febr.repo)
if (inherits(tmp, "data.frame")) {
unit <- .readFEBR(
data.set = dataset_ids[i], data.table = "metadado", febr.repo = febr.repo)
unit$campo_unidade[is.na(unit$campo_unidade)] <- "-"
unit <- unit[unit$tabela_id == "observacao", c("campo_id", "campo_nome", "campo_unidade")]
unit <- as.data.frame(t(unit), stringsAsFactors = FALSE)
colnames(unit) <- unlist(unit[1, ])
unit <- unit[-1, ]
n_rows <- nrow(tmp)
# PADRONIZAÇÀO/ATUALIZAÇÃO DOS NOMES DE TODOS OS CAMPOS
# * tmp: tabela de dados
# * unit: tabela de unidades de medida
in_cols <- colnames(tmp)
test_oldid_colnames <- dic[["campo_oldid"]] %in% in_cols
if (any(test_oldid_colnames)) {
cross_colnames <- dic[which(test_oldid_colnames), ]
data.table::setnames(tmp, cross_colnames[["campo_oldid"]], cross_colnames[["campo_id"]])
in_cols <- colnames(tmp)
data.table::setnames(unit, cross_colnames[["campo_oldid"]], cross_colnames[["campo_id"]])
}
# Nomes das colunas usadas para armazenar coordenadas espaciais
coord_names <- c("coord_longitude", "coord_latitude")
# PROCESSAMENTO I
## A decisão pelo processamento dos dados começa pela verificação de dados faltantes nas
## coordenadas e na data.
na_coord <- max(apply(tmp[, coord_names], 2, function(x) sum(is.na(x))))
na_time <- is.na(tmp$evento_data)
n_na_time <- sum(na_time)
if (missing$coord == "keep" && missing$time == "keep" ||
missing$coord == "drop" && na_coord < n_rows && missing$time == "keep" |
missing$time == "drop" ||
missing$coord == "keep" | missing$coord == "drop" && missing$time == "drop" &&
n_na_time < n_rows) {
# COLUNAS
## Definir as colunas a serem mantidas para além das colunas obrigatórias
## É possível que algumas colunas adicionais não contenham quaisquer dados,
## sendo ocupadas por 'NA': tais colunas são descartadas da tabela de dados.
cols <- in_cols[in_cols %in% std_cols[["campo_id"]]]
extra_cols <- vector()
if (!missing(variable)) {
if (length(variable) == 1 && variable == "all") {
extra_cols <- in_cols[!in_cols %in% std_cols[["campo_id"]]]
idx_na <- apply(tmp[extra_cols], 2, function(x) all(is.na(x)))
extra_cols <- extra_cols[!idx_na]
} else {
extra_cols <- lapply(variable, function(x) {
in_cols[grep(paste("^", x, sep = ""), in_cols)]
})
extra_cols <- unlist(extra_cols)
extra_cols <- extra_cols[!extra_cols %in% std_cols[["campo_id"]]]
idx_na <- apply(tmp[extra_cols], 2, function(x) all(is.na(x)))
extra_cols <- extra_cols[!idx_na]
}
}
cols <- c(cols, extra_cols)
tmp <- tmp[cols]
unit <- unit[cols]
# LINHAS I
## Avaliar limpeza das linhas
tmp_clean <- .cleanRows(obj = tmp, missing = missing, extra_cols = extra_cols,
coord.names = coord_names)
n_rows <- nrow(tmp_clean)
# PROCESSAMENTO II
## A continuação do processamento dos dados depende das presença de dados após a eliminação
## de colunas e linhas com NAs.
if (n_rows >= 1 && missing(variable) || missing$data == "keep") {
# LINHAS II
## Definir as linhas a serem mantidas
## É preciso considerar todas as possibilidades de remoção de dados
if (missing$data == "drop" || missing$coord == 'drop' || missing$time == 'drop') {
tmp <- tmp_clean
}
# TIPO DE DADOS
## 'evento_id_febr', 'sisb_id' e 'ibge_id' precisam estar no formato de caracter para
## evitar erros durante o empilhamento das tabelas devido ao tipo de dado.
## Nota: esse processamento deve ser feito via Google Sheets.
tmp$evento_id_febr <- as.character(tmp$evento_id_febr)
if ("sisb_id" %in% colnames(tmp)) {
tmp$sisb_id <- as.character(tmp$sisb_id)
}
if ("ibge_id" %in% colnames(tmp)) {
tmp$ibge_id <- as.character(tmp$ibge_id)
}
# 'coord_precisao' precisa estar no formato numérico ao invés de inteiro
if ("coord_precisao" %in% colnames(tmp)) {
tmp$coord_precisao <- as.numeric(tmp$coord_precisao)
}
# PADRONIZAÇÃO I
## Sistema de referência de coordenadas
## Primeiro verificar se existem observações com coordenadas e se o SRC deve ser
## transformado
na_coord <- max(apply(tmp[, coord_names], 2, function(x) sum(is.na(x))))
if (n_rows > na_coord && !is.null(standardization$crs)) {
tmp <- .crsTransform(obj = tmp, crs = standardization$crs, coord.names = coord_names)
}
# PADRONIZAÇÃO II
## Data de observação
if (n_rows > n_na_time && !is.null(standardization$time.format)) {
# if (n_rows > na_time && !is.null(standardization$time.format)) {
tmp <- .formatObservationDate(obj = tmp, time.format = standardization$time.format)
}
# PADRONIZAÇÃO III
## Unidade de medida e número de casas decimais das colunas adicionais
if (standardization$units && length(extra_cols) >= 1) {
## Identificar variáveis contínuas (classe 'numeric' e 'integer'), excluíndo variáveis
## de identificação padrão
## TODO: EXCETO 'coord_precisao'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
id_class <- sapply(tmp, class)
cont_idx <- which(id_class %in% c("numeric", "integer") &
!names(id_class) %in% std_cols[["campo_id"]])
if (length(cont_idx) >= 1) {
# Tabela com padrões das variáveis contínuas identificadas
tmp_stds <- match(cols[cont_idx], febr_stds$campo_id)
tmp_stds <- febr_stds[tmp_stds, c("campo_id", "campo_unidade", "campo_precisao")]
## 1. Se necessário, padronizar unidades de medida
## # verifica a 2ª linha de metadados
need_idx <- unit[2, cols[cont_idx]] != tmp_stds$campo_unidade
if (any(need_idx)) {
need_name <- cols[cont_idx][need_idx]
source <- unit[2, need_name]
target <- tmp_stds$campo_unidade[match(need_name, tmp_stds$campo_id)]
## Identificar constante
k <- lapply(seq_along(source), function(i) {
# i <- 2
idx <- febr_unit$unidade_origem %in% source[i] +
febr_unit$unidade_destino %in% target[i]
febr_unit[idx == 2, ]
})
k <- do.call(rbind, k)
## Processar dados
tmp[need_name] <- mapply(`*`, tmp[need_name], k$unidade_constante)
unit[2, need_name] <- k$unidade_destino
}
## 2. Se necessário, padronizar número de casas decimais
if (standardization$round) {
tmp[tmp_stds$campo_id] <-
sapply(seq(nrow(tmp_stds)), function(i)
round(x = tmp[tmp_stds$campo_id[i]], digits = tmp_stds$campo_precisao[i]))
}
}
}
# ATTRIBUTOS I
## Processar unidades de medida
unit[2, ] <- as.character(unit[2, names(unit) %in% cols])
unit[2, ] <- gsub("^-$", "unitless", unit[2, ])
# https://en.wikipedia.org/wiki/List_of_Unicode_characters
unit["evento_id_febr"] <- c("Identifica\u00E7\u00E3o da observa\u00E7\u00E3o", "unitless")
dataset_id <- c("Identifica\u00E7\u00E3o do conjunto de dados", "unitless")
unit <- cbind(dataset_id, unit)
# HARMONIZAÇÃO I
## Harmonização dos dados das colunas adicionais
if (harmonization$harmonize && length(extra_cols) >= 1) {
## Harmonização baseada nos níveis dos códigos de identificação
tmp <- .harmonizeByName(obj = tmp, extra_cols = extra_cols,
harmonization = harmonization)
}
# IDENTIFICAÇÃO
## Código de identificação do conjunto de dados
res[[i]] <- cbind(dataset_id = dataset_ids[i], tmp, stringsAsFactors = FALSE)
# ATTRIBUTOS II
a <- attributes(res[[i]])
## Adicionar nomes reais
a$field_name <- as.vector(t(unit)[, 1])
## Adicionar unidades de medida
a$field_unit <- as.vector(t(unit)[, 2])
attributes(res[[i]]) <- a
if (progress) {
utils::setTxtProgressBar(pb, i)
}
} else {
res[[i]] <- data.frame()
m <- paste("All observations in {dts} are missing data. None will be returned.")
message(m)
}
} else {
res[[i]] <- data.frame()
if (na_coord == n_rows) {
m <-
paste0("All observations in", dts, "are missing coordinates. None will be returned.")
} else if (n_na_time == n_rows) {
m <- paste("All observations in", dts, "are missing date. None will be returned.")
}
message(m)
}
} else {
res[[i]] <- tmp
}
}
if (progress) {
close(pb)
}
# FINAL
## Empilhar conjuntos de dados
## Adicionar unidades de medida
if (stack) {
res <- .stackTables(obj = res)
} else if (n_datasets == 1 & inherits(res, "list")) {
res <- res[[1]]
} else {
names(res) <- dataset_ids
}
return(res)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.