raise_incomplete_dataset_to_total_dataset: Raise incomplete dataset to total dataset

View source: R/raise_incomplete_dataset_to_total_dataset.R

raise_incomplete_dataset_to_total_datasetR Documentation

Raise incomplete dataset to total dataset

Description

Provided two datasets, one for which the information information is usually stratified is space and/or time but the associated measure represents only part of the reality ("incomplete" dataset), and one for wich the the information is usually less stratified (i.e. more aggregated) but the measure represents the reality ("total" dataset): this function raises the "incomplete" dataset to the "total" dataset, using raising factors. Raising factors are the proportion of data that of the "total" dataset that are available in the "incomplete" dataset (see section "Details" and function raise_get_rf for additional information).

Usage

raise_incomplete_dataset_to_total_dataset(df_input_incomplete, df_input_total,
  df_rf, x_raising_dimensions, decrease_when_rf_inferior_to_one = TRUE,
  threshold_rf = NULL)

Arguments

df_input_incomplete

data.frame "incomplete", to raise. Must have a set of dimensions (columns) + a value column

df_input_total

data.frame "total". Must have a set of dimensions (columns) + a value column

df_rf

data.frame of raising factors. Ouput of the function raise_get_rf

x_raising_dimensions

vector of dimensions (i.e. dimensions that compose the stratum) to use for the computation of the raising factors. The dimensions must be available in both input data.frames.

decrease_when_rf_inferior_to_one

boolean. If the raising factor is inferior to 1 (i.e. data in df_input_incomplete is superior to data in df_input_total), should the incomplete data be descreaed? TRUE = decrease. FALSE = do not decrease. Default is TRUE

threshold_rf

numeric from 0 to 100. If the raising factor for a stratum is not above this treshold, the data will be removed from the dataset.

Details

It is possible to understand the concept of raising with the following example.

Let us take a first dataset df_input_incomplete representing the catches for the stratum defined by Flag=AUS, Year=1992, Gear=LL, Species=YFT , stratified by 5° quadrants and 1 month time resolution (typical catch-and-effort dataset):

flag time_start time_end geographic_identifier gear species value
AUS 1992-02-01 1992-03-01 235140 LL YFT 0.05
AUS 1992-06-01 1992-07-01 230125 LL YFT 0.42
AUS 1992-07-01 1992-08-01 235140 LL YFT 1.05
AUS 1992-07-01 1992-08-01 240140 LL YFT 0.15
AUS 1992-08-01 1992-09-01 240140 LL YFT 0.61
AUS 1992-11-01 1992-12-01 230120 LL YFT 0.15
AUS 1992-11-01 1992-12-01 230125 LL YFT 0.15
AUS 1992-11-01 1992-12-01 235115 LL YFT 1.65
AUS 1992-11-01 1992-12-01 235135 LL YFT 5
AUS 1992-12-01 1993-01-01 235110 LL YFT 0.08
AUS 1992-12-01 1993-01-01 240140 LL YFT 0.12

The second dataset df_input_total represents the catches for the same stratum (Flag=AUS, Year=1992, Gear=LL, Species=YFT), however the data is more aggregated: the tuna RFMO area of competence and 1 year time resolution (typical nominal catch dataset):

flag time_start time_end geographic_identifier gear species value
AUS 1992-02-01 1993-01-01 IOTC LL YFT 14

The information in both datasets is the same, however, in the dataset df_input_incomplete, only a sample of the catches has been reported. The catches reported in df_input_total are the real catches that happened in this stratum.

Raising the dataset df_input_incomplete to the dataset df_input_total with x_raising_dimensions = c("flag","year","gear","species") means:

  • 1. Calculating S2: Summing the data from df_input_incomplete by year (i.e. sum of the months) for each stratum (i.e. combination of flag, year, gear, species)). In the example, S2 = 9.43 At this stage, we get the following table:

    flag time_start time_end geographic_identifier gear species value sum_incomplete_catch
    AUS 1992-02-01 1992-03-01 235140 LL YFT 0.05 9.43
    AUS 1992-06-01 1992-07-01 230125 LL YFT 0.42 9.43
    AUS 1992-07-01 1992-08-01 235140 LL YFT 1.05 9.43
    AUS 1992-07-01 1992-08-01 240140 LL YFT 0.15 9.43
    AUS 1992-08-01 1992-09-01 240140 LL YFT 0.61 9.43
    AUS 1992-11-01 1992-12-01 230120 LL YFT 0.15 9.43
    AUS 1992-11-01 1992-12-01 230125 LL YFT 0.15 9.43
    AUS 1992-11-01 1992-12-01 235115 LL YFT 1.65 9.43
    AUS 1992-11-01 1992-12-01 235135 LL YFT 5 9.43
    AUS 1992-12-01 1993-01-01 235110 LL YFT 0.08 9.43
    AUS 1992-12-01 1993-01-01 240140 LL YFT 0.12 9.43
  • 2.Calculating S1: Summing the data from df_input_total by year for each stratum (i.e. combination of flag, year, gear, species)). In the example, S1 = 14

  • 3. Calculating the raising factor RF for each stratum: RF = S1/S2. This is done with the function raise_get_rf. RF is the proportion of total catch that are available in the catch-and-effort. In the example, RF = 14/9.43 = 1.48. This means that (1 / RF * 100 = ) 67 At this stage, we get the following table:

    flag time_start time_end geographic_identifier gear species value sum_incomplete_catch sum_total_catch RF
    AUS 1992-02-01 1992-03-01 235140 LL YFT 0.05 9.43 14 1.48
    AUS 1992-06-01 1992-07-01 230125 LL YFT 0.42 9.43 14 1.48
    AUS 1992-07-01 1992-08-01 235140 LL YFT 1.05 9.43 14 1.48
    AUS 1992-07-01 1992-08-01 240140 LL YFT 0.15 9.43 14 1.48
    AUS 1992-08-01 1992-09-01 240140 LL YFT 0.61 9.43 14 1.48
    AUS 1992-11-01 1992-12-01 230120 LL YFT 0.15 9.43 14 1.48
    AUS 1992-11-01 1992-12-01 230125 LL YFT 0.15 9.43 14 1.48
    AUS 1992-11-01 1992-12-01 235115 LL YFT 1.65 9.43 14 1.48
    AUS 1992-11-01 1992-12-01 235135 LL YFT 5 9.43 14 1.48
    AUS 1992-12-01 1993-01-01 235110 LL YFT 0.08 9.43 14 1.48
    AUS 1992-12-01 1993-01-01 240140 LL YFT 0.12 9.43 14 1.48
  • 4. Raising the df_input_incomplete dataset: Multiply each value from the catch-and-effort dataset by the RF associated.

After the raising, the new dataset for catch-and-effort is the following:

flag time_start time_end geographic_identifier gear species value
AUS 1992-02-01 1992-03-01 235140 LL YFT 0.074
AUS 1992-06-01 1992-07-01 230125 LL YFT 0.6216
AUS 1992-07-01 1992-08-01 235140 LL YFT 1.55
AUS 1992-07-01 1992-08-01 240140 LL YFT 0.222
AUS 1992-08-01 1992-09-01 240140 LL YFT 0.9028
AUS 1992-11-01 1992-12-01 230120 LL YFT 0.222
AUS 1992-11-01 1992-12-01 230125 LL YFT 0.222
AUS 1992-11-01 1992-12-01 235115 LL YFT 2.442
AUS 1992-11-01 1992-12-01 235135 LL YFT 7.4
AUS 1992-12-01 1993-01-01 235110 LL YFT 0.1184
AUS 1992-12-01 1993-01-01 240140 LL YFT 0.1776

The parameter threshold_rf allows to remove some data based on the raising factor. In the example, setting threshold_rf = 10 would remove all the data for which RF = 10 ; in other words, any data for which less that 10 per cent of the total catches are available in the catch-and-effort for the stratum set with x_raising_dimensions. Setting NULL will imply that not data are filtered.

The object "stats" of the output list is a data.frame whose columns are the followings. In the list below "sum" is the sum of the data, i.e. the sum of the column 'value' of the dataset:

  • sum_df_total: sum extracted from df_input_total

  • sum_df_incomplete_before_raising: sum extracted from df_input_incomplete before the raising

  • sum_df_incomplete_after_raising: sum extracted from df_input_incomplete after the raising

  • sum_df_incomplete_do_not_exist_in_df_total: sum extracted from df_input_incomplete that cannot be raised because the strata exists in df_input_incomplete but does not exist in df_input_total (ie sum of the strata exists in df_input_incomplete but not in df_input_total)The raised value for these strata is equal to the raw value.

  • sum_df_total_do_not_exist_in_df_incomplete: sum extracted from df_input_total that do not exist in df_input_incomplete (ie sum of the strata exists in df_input_total but not in df_input_incomplete)

  • perc_df_incomplete_over_df_total_before_raising: percentage of the data of df_input_total that were available in df_input_incomplete before the raising

  • perc_df_incomplete_over_df_total_after_raising: percentage of the data of df_input_total that are available in df_input_incomplete after the raising

  • perc_df_incomplete_do_not_exist_in_df_total: percentage of the data coming from df_input_incomplete for which there is no correspondance in df_input_total (i.e perc. of the strata that exist in df_input_incomplete but do not exist in df_input_total)

  • perc_df_total_do_not_exist_in_df_incomplete: percentage of the data coming from df_input_total for which there is no correspondance in df_input_total (i.e perc. of the strata that exist in df_input_total but do not exist in df_input_incomplete)

Usually, for catches: x_raising_dimensions = c("gear","flag","species","year","source_authority","unit") and for efforts: c("gear","flag","year","source_authority","unit").

Value

a list with three objects:

  • "df": data.frame. Representing df_input_incomplete raised to df_input_total

  • "stats": data.frame. Information regarding the raising process (see Details for additional information).

See Also

Other process data: convert_units, create_calendar, create_grid, get_rfmos_datasets_level0, map_codelist, raise_datasets_by_dimension, raise_get_rf, rasterize_geo_timeseries, spatial_curation_downgrade_resolution, spatial_curation_intersect_areas, spatial_curation_reallocate_data, spatial_curation_upgrade_resolution

Examples


# Connect to Tuna atlas database
con<-db_connection_tunaatlas_world()

# Extract IOTC georeferenced catch time series of catches from Sardara DB
ind_catch_tunaatlasird_level1<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_catch_1952_11_01_2016_01_01_tunaatlasIRD_level1"))
head(ind_catch_tunaatlasird_level1)

# Extract IOTC total (nominal) catch time series from Sardara DB
ind_nominal_catch_tunaatlasiotc_level0<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_nominal_catch_1950_01_01_2015_01_01_tunaatlasIOTC_2017_level0"))
head(ind_nominal_catch_tunaatlasiotc_level0)

## Raise georeferenced catch to total catch. Raise by {gear, flag, species, year, source_authority, unit}

# First calculate the dataset of raising factor
df_rf<-raise_get_rf(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit")
)

# Then raise
ind_catch_tunaatlasird_level2<-raise_incomplete_dataset_to_total_dataset(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
df_rf=df_rf,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit"),
threshold_rf=NULL)

head(ind_catch_tunaatlasird_level2$df)

# get statistics on the raising process
ind_catch_tunaatlasird_level2$stats

dbDisconnect(con)


ptaconet/rtunaatlas documentation built on June 23, 2024, 9:35 p.m.