knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The ASCII files provided by the EPA contain all data required for building the
local database (download_ecotox_data()
). As documented by the EPA most
table fields are stored as text (with a few exceptions). During the build
process, all fields are kept as is, without any cleanup or standardisation.
This is done to avoid any data loss or corruption and keep it as close
to the source as possible. Therefore, it is likely that you need to post-process
data after querying the locally built database.
htmltools::includeHTML("../man/figures/ecotox-workflow.svg")
Although it is the user's responsibility to evaluate the correctness and
validity of the data, the ECOTOXr
package provides some tools to make the
cleanup process easier. This vignette presents important aspects for cleaning:
In general there are two types of sanitising functions; those named
as_..._ecotox()
and those starting with process_ecotox_...s()
. Where
'...
' is the data type being sanitised (e.g. unit
, numeric
, or date
).
The first function type ('as') handles vectors of character
s. The second
function type ('process'), handles data.frame
s, where relevant columns are
automatically detected and processed with the 'as' type functions.
Note that the sanitation routines are subject to development, so they may change.
For reproducible results you should therefore always report the version of
ECOTOXr
you are using (cite_ecotox()
).
Quantity units are vital for the interpretation of measurements. The ECOTOX database
contains units as reported by its source publication. As a result, the units are often
not stored consistently and are not standardised. The ECOTOXr
package implements
a function that sanitises the unit text fields and then parses them with the
units package. This package provides
instruments to convert units using the
UNIDATA udunits library.
The advantage of using the units
package is that it provides a mechanism
to apply arithmetic manipulations of data and conversion between compatible
units. Or as the documentation of the package puts it:
"When used in expression, it automatically converts units, and simplifies units of
results when possible; in case of incompatible units, errors are raised"
So the goal of the sanitation steps here is to create a format that can be parsed
by the units
package. In order to achieve this the following steps are performed:
concentration
.units
package are
converted such that they are handled correctly. For instance 'C' in the ECOTOX
database frequently stands for 'degrees Celsius' (although it is also used to
indicate Carbon). So if it's not an annotation, 'C' is replaced with 'Celsius'.
Another example is 'sqft' (square feet) which can not be interpreted by the
units package is replaced with 'ft2'.units::mixed_units(1, "unit")
).The documentation of the as_unit_ecotox()
function has a more detailed description
of the cleanup procedure. If you need even more details you can check the
source code.
In order to demonstrate how unit sanitation works in this packages, let's first initialise a vector of mishmash units. These are actually a random sample from the ECOTOX database, not necessarily the most common ones:
library(ECOTOXr) |> suppressMessages() library(dplyr) |> suppressMessages() mishmash <- c("ppm-d", "ml/2.5 cm eu", "fl oz/10 gal/1k sqft", "kg/100 L", "mopm", "ng/kg", "ug", "AI ng/g", "PH", "pm", "uM/cm3", "1e-4 mM", "degree", "fs", "mg/TI", "RR", "ug/g org/d", "1e+4 IU/TI", "pg/mg TE", "pmol/mg", "1e-9/l", "no >15 cm", "umol/mg pro", "cc/org/wk", "PIg/L", "ug/100 ul/org", "ae mg/kg diet/d", "umol/mg/h", "cmol/kg d soil", "ug/L diet", "kg/100 kg sd", "1e+6 cells", "ul diet", "S", "mmol/h/g TI", "g/70 d", "vg", "ng/200 mg diet", "uS/cm2", "AI ml/ha", "AI pt/acre", "mg P/h/g TI", "no/m", "kg/ton sd", "ug/g wet wt", "AI mg/2 L diet", "nmol/TI", "umol/g wet wt", "PSU", "Wijs number")
With as_unit_ecotox()
, the mishmash of units, represented by character
strings
are cleaned and coerced to units::mixed_units()
. As units
objects have a numeric
component, but the character
strings from the database do not, each unit is given
a value of 1
. As you can see not all units in the mishmash
vector can be interpreted
and are just returned as arbitrary 1 unit
.
as_unit_ecotox(mishmash, warn = FALSE)
With process_ecotox_units()
you can process an entire data.frame
/tibble
, where
each column ending with _unit
is processed (i.e. as_unit_ecotox()
is called on them).
By setting the .names
argument, you can preserve the original unit column:
tibble(mishmash_unit = mishmash) |> process_ecotox_units(.names = "{.col}_parsed", warn = FALSE)
As the database contains over 6,000 unique unit codes, it is likely that not all units are processable. Also, because the codes are not always consistent, some of them may not be interpreted correctly. Most frequently occurring units should parse correctly. If you think a specific code is not parsed correctly, and it is not highly outlandish, you could file an issue report. Furthermore, you should always inspect automatically parsed units for correctness.
Another point of attention is the removal of annotations from the unit. Consider the concentration unit with the following annotations:
as_unit_ecotox(c("mg/L CO3", "mg/L CaCO3", "mg/L HCO3"))
Note that they are all interpreted as [mg/L]
. Although technically the same unit,
they are definitely not directly compatible. The units
package does
not support annotations, so you need to keep track of them yourself.
First let me explain what is meant by 'numerics' in the ECOTOX database. These
are all records that have a accompanying measurement unit in the database. This
includes, concentrations, durations and many others. These records are stored
as text fields in the database. So in order to interpret them as actual numerics
in R, they need to be coerced to numerics. You could use a simple call to
as.numeric()
to do this, but that will not always work.
The text fields may contain operators such as '<', '>', '~', etc. I think this is a mistake and not by design, because many of the numeric fields have a corresponding operator field where this operator could be stored. Text fields can also contain labelling text (such as asterisk symbol) or inconsistent decimal or thousand separators.
This is why there is as_numeric_ecotox()
which first cleans the text records before
coercing them to numerics:
## Text fields as possibly encountered in the database text_records <- c("10", " 2", "3 ", "~5", "9.2*", "2,33", "2,333", "2.1(1.0 - 3.2)", "1-5", "1e-3") as_numeric_ecotox(text_records)
You can use process_ecotox_numerics()
to process a data.frame
/tibble
resulting
from a search query. It automatically applies as_numeric_ecotox()
to columns containing
numeric information:
text_tbl <- tibble(conc1_mean = text_records) process_ecotox_numerics(text_tbl, warn = FALSE)
As indicated above all notations and operators included with numerics are stripped in
the cleaning process. These notations and operators are potentially important for the
interpretation of the values. It may be wise to keep track of them. One way to do this
is by first trying to coerce texts to numerics with as.numeric()
and then with
as_numeric_ecotox()
. The cases where the first returns NA
but the latter returns a
value, is likely to contain notations or operators (or is just inconsistently formatted).
You could also use the .names
argument in process_ecotox_numerics()
to rename
the numeric columns and keep the original text fields.
process_ecotox_numerics(text_tbl, warn = FALSE, .names = "{.col}_num") |> mutate( test = is.na(as.numeric(conc1_mean)) & !is.na(as_numeric_ecotox(conc1_mean, warn = FALSE)) )
The steps above show how to sanitise numerics and units separately. In order
to standardise numeric values to a specific unit, these steps need to be combined.
This can be achieved by calling process_ecotox_numerics()
with add_units
set to
TRUE
. This will add the corresponding units to the numeric value. But they are
still mixed units. By calling mixed_to_single_unit()
you can convert the values
to a specific unit.
tibble( conc1_mean = c("1", "2", "5e-4", "0.2"), conc1_unit = c("mg/L", "M", "% w/v", "ppt w/v") ) |> process_ecotox_numerics(add_units = TRUE) |> mutate( conc1_mean_standard = mixed_to_single_unit(conc1_mean, "ug/L") )
Note that in the example above not all units can be converted to the
target unit "ug/L"
. This is because concentrations in 'mol/L' requires
the molar mass of the solute in order to convert. It is returned as NA
.
The ECOTOX contains several date fields. They can represent meta-information
about the record (date created and modified), administrative information
(publication date), or experimental information (application date). These
dates are stored as text in the database. As not all records are consistent
or complete, some cleaning is required before coercing the text to a Date
object (?Date
).
The example below shows some typical date formats as encountered in the
database and how to coerce them to Date
objects using as_date_ecotox()
:
char_date <- c("5-19-1987 ", "5/dd/2021", "3/19/yyyy", "1985", "mm/19/1999", "October 2004", "nr/nr/2015") as_date_ecotox(char_date)
The only date that cannot be coerced is the one with an unspecified year. It
is returned as NA
.
You can use process_ecotox_dates()
to process a data.frame
/tibble
as returned
by a search query. Date columns are automatically coerced with as_date_ecotox()
.
Column names ending with _date
are recognised as date records.
tibble( my_date = char_date ) |> process_ecotox_dates()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.