prepsources: Filter and aggregate the raw source dataset

View source: R/prepsources.R

prepsourcesR Documentation

Filter and aggregate the raw source dataset

Description

This function prepares the available dataset to be used for creating the isoscape (e.g. GNIPDataDE). This function allows the trimming of data by months, years and location, and for the aggregation of selected data per location, location:month combination or location:year combination. The function can also be used to randomly exclude some observations.

Usage

prepsources(
  data,
  month = 1:12,
  year,
  long_min = -180,
  long_max = 180,
  lat_min = -90,
  lat_max = 90,
  split_by = NULL,
  prop_random = 0,
  random_level = "source",
  col_source_value = "source_value",
  col_source_ID = "source_ID",
  col_lat = "lat",
  col_long = "long",
  col_elev = "elev",
  col_month = "month",
  col_year = "year"
)

Arguments

data

A dataframe containing raw isotopic measurements of sources

month

A numeric vector indicating the months to select from. Should be a vector of round numbers between 1 and 12. The default is 1:12 selecting all months.

year

A numeric vector indicating the years to select from. Should be a vector of round numbers. The default is to select all years available.

long_min

A numeric indicating the minimum longitude to select from. Should be a number between -180 and 180 (default = -180).

long_max

A numeric indicating the maximal longitude to select from. Should be a number between -180 and 180 (default = 180).

lat_min

A numeric indicating the minimum latitude to select from. Should be a number between -90 and 90 (default = -90).

lat_max

A numeric indicating the maximal latitude to select from (default = 90).

split_by

A string indicating whether data should be aggregated per location (split_by = NULL, the default), per location:month combination (split_by = "month"), or per location:year combination (split_by = "year").

prop_random

A numeric indicating the proportion of observations or sampling locations (depending on the argument for random_level) that will be kept. If prop_random is greater than 0, then the function will return a list containing two dataframes: one containing the selected data, called selected_data, and one containing the remaining data, called remaining_data.

random_level

A string indicating the level at which random draws can be performed. The two possibilities are "obs", which indicates that observations are randomly drawn taken independently of their location, or "source" (default), which indicates that observations are randomly drawn at the level of sampling locations.

col_source_value

A string indicating the column containing the isotopic measurements

col_source_ID

A string indicating the column containing the ID of each sampling location

col_lat

A string indicating the column containing the latitude of each sampling location

col_long

A string indicating the column containing the longitude of each sampling location

col_elev

A string indicating the column containing the elevation of each sampling location

col_month

A string indicating the column containing the month of sampling

col_year

A string indicating the column containing the year of sampling

Details

This function aggregates the data as required for the IsoriX workflow. Three aggregation schemes are possible for now. The most simple one, used as default, aggregates the data so to obtained a single row per sampling location. Datasets prepared in this way can be readily fitted with the function isofit to build an isoscape. It is also possible to aggregate data in a different way in order to build sub-isoscapes representing temporal variation in isotope composition, or in order to produce isoscapes weighted by the amount of precipitation (for isoscapes on precipitation data only). The two possible options are to either split the data from each location by month or to split them by year. This is set with the split_by argument of the function. Datasets prepared in this way should be fitted with the function isomultifit.

The function also allows the user to filter the sampling locations based on time (years and/ or months) and space (locations given in geographic coordinates, i.e. longitude and latitude) to calculate tailored isoscapes matching e.g. the time of sampling and speeding up the model fit by cropping/clipping a certain area. The dataframe produced by this function can be used as input to fit the isoscape (see isofit and isomultifit).

Value

This function returns a dataframe containing the filtered data aggregated by sampling location, or a list, see above argument prop_random. For each sampling location the mean and variance sample estimates are computed.

Examples

## Create a processed dataset for Germany
GNIPDataDEagg <- prepsources(data = GNIPDataDE)

head(GNIPDataDEagg)

## Create a processed dataset for Germany per month
GNIPDataDEmonthly <- prepsources(
  data = GNIPDataDE,
  split_by = "month"
)

head(GNIPDataDEmonthly)

## Create a processed dataset for Germany per year
GNIPDataDEyearly <- prepsources(
  data = GNIPDataDE,
  split_by = "year"
)

head(GNIPDataDEyearly)

## Create isoscape-dataset for warm months in germany between 1995 and 1996
GNIPDataDEwarm <- prepsources(
  data = GNIPDataDE,
  month = 5:8,
  year = 1995:1996
)

head(GNIPDataDEwarm)


## Create a dataset with 90% of obs
GNIPDataDE90pct <- prepsources(
  data = GNIPDataDE,
  prop_random = 0.9,
  random_level = "obs"
)

lapply(GNIPDataDE90pct, head) # show beginning of both datasets

## Create a dataset with half the weather sources
GNIPDataDE50pctsources <- prepsources(
  data = GNIPDataDE,
  prop_random = 0.5,
  random_level = "source"
)

lapply(GNIPDataDE50pctsources, head)


## Create a dataset with half the weather sources split per month
GNIPDataDE50pctsourcesMonthly <- prepsources(
  data = GNIPDataDE,
  split_by = "month",
  prop_random = 0.5,
  random_level = "source"
)

lapply(GNIPDataDE50pctsourcesMonthly, head)


IsoriX documentation built on Nov. 14, 2023, 5:09 p.m.