raise_incomplete_dataset_to_total_dataset: Raise incomplete dataset to total dataset
In ptaconet/rtunaatlas: IRD Sardara Tuna Atlas R facilities

View source: R/raise_incomplete_dataset_to_total_dataset.R

raise_incomplete_dataset_to_total_dataset

R Documentation

Raise incomplete dataset to total dataset

Description

Provided two datasets, one for which the information information is usually stratified is space and/or time but the associated measure represents only part of the reality ("incomplete" dataset), and one for wich the the information is usually less stratified (i.e. more aggregated) but the measure represents the reality ("total" dataset): this function raises the "incomplete" dataset to the "total" dataset, using raising factors. Raising factors are the proportion of data that of the "total" dataset that are available in the "incomplete" dataset (see section "Details" and function raise_get_rf for additional information).

Usage

raise_incomplete_dataset_to_total_dataset(df_input_incomplete, df_input_total,
  df_rf, x_raising_dimensions, decrease_when_rf_inferior_to_one = TRUE,
  threshold_rf = NULL)

Arguments

`df_input_incomplete`	data.frame "incomplete", to raise. Must have a set of dimensions (columns) + a value column
`df_input_total`	data.frame "total". Must have a set of dimensions (columns) + a value column
`df_rf`	data.frame of raising factors. Ouput of the function raise_get_rf
`x_raising_dimensions`	vector of dimensions (i.e. dimensions that compose the stratum) to use for the computation of the raising factors. The dimensions must be available in both input data.frames.
`decrease_when_rf_inferior_to_one`	boolean. If the raising factor is inferior to 1 (i.e. data in `df_input_incomplete` is superior to data in `df_input_total`), should the incomplete data be descreaed? TRUE = decrease. FALSE = do not decrease. Default is TRUE
`threshold_rf`	numeric from 0 to 100. If the raising factor for a stratum is not above this treshold, the data will be removed from the dataset.

Details

It is possible to understand the concept of raising with the following example.

Let us take a first dataset df_input_incomplete representing the catches for the stratum defined by Flag=AUS, Year=1992, Gear=LL, Species=YFT , stratified by 5° quadrants and 1 month time resolution (typical catch-and-effort dataset):

flag	time_start	time_end	geographic_identifier	gear	species	value
AUS	1992-02-01	1992-03-01	235140	LL	YFT	0.05
AUS	1992-06-01	1992-07-01	230125	LL	YFT	0.42
AUS	1992-07-01	1992-08-01	235140	LL	YFT	1.05
AUS	1992-07-01	1992-08-01	240140	LL	YFT	0.15
AUS	1992-08-01	1992-09-01	240140	LL	YFT	0.61
AUS	1992-11-01	1992-12-01	230120	LL	YFT	0.15
AUS	1992-11-01	1992-12-01	230125	LL	YFT	0.15
AUS	1992-11-01	1992-12-01	235115	LL	YFT	1.65
AUS	1992-11-01	1992-12-01	235135	LL	YFT	5
AUS	1992-12-01	1993-01-01	235110	LL	YFT	0.08
AUS	1992-12-01	1993-01-01	240140	LL	YFT	0.12

The second dataset df_input_total represents the catches for the same stratum (Flag=AUS, Year=1992, Gear=LL, Species=YFT), however the data is more aggregated: the tuna RFMO area of competence and 1 year time resolution (typical nominal catch dataset):

flag	time_start	time_end	geographic_identifier	gear	species	value
AUS	1992-02-01	1993-01-01	IOTC	LL	YFT	14

The information in both datasets is the same, however, in the dataset df_input_incomplete, only a sample of the catches has been reported. The catches reported in df_input_total are the real catches that happened in this stratum.

Raising the dataset df_input_incomplete to the dataset df_input_total with x_raising_dimensions = c("flag","year","gear","species") means:

1. Calculating S2: Summing the data from df_input_incomplete by year (i.e. sum of the months) for each stratum (i.e. combination of flag, year, gear, species)). In the example, S2 = 9.43 At this stage, we get the following table:

flag	time_start	time_end	geographic_identifier	gear	species	value	sum_incomplete_catch
AUS	1992-02-01	1992-03-01	235140	LL	YFT	0.05	9.43
AUS	1992-06-01	1992-07-01	230125	LL	YFT	0.42	9.43
AUS	1992-07-01	1992-08-01	235140	LL	YFT	1.05	9.43
AUS	1992-07-01	1992-08-01	240140	LL	YFT	0.15	9.43
AUS	1992-08-01	1992-09-01	240140	LL	YFT	0.61	9.43
AUS	1992-11-01	1992-12-01	230120	LL	YFT	0.15	9.43
AUS	1992-11-01	1992-12-01	230125	LL	YFT	0.15	9.43
AUS	1992-11-01	1992-12-01	235115	LL	YFT	1.65	9.43
AUS	1992-11-01	1992-12-01	235135	LL	YFT	5	9.43
AUS	1992-12-01	1993-01-01	235110	LL	YFT	0.08	9.43
AUS	1992-12-01	1993-01-01	240140	LL	YFT	0.12	9.43

2.Calculating S1: Summing the data from df_input_total by year for each stratum (i.e. combination of flag, year, gear, species)). In the example, S1 = 14

3. Calculating the raising factor RF for each stratum: RF = S1/S2. This is done with the function raise_get_rf. RF is the proportion of total catch that are available in the catch-and-effort. In the example, RF = 14/9.43 = 1.48. This means that (1 / RF * 100 = ) 67 At this stage, we get the following table:

flag	time_start	time_end	geographic_identifier	gear	species	value	sum_incomplete_catch	sum_total_catch	RF
AUS	1992-02-01	1992-03-01	235140	LL	YFT	0.05	9.43	14	1.48
AUS	1992-06-01	1992-07-01	230125	LL	YFT	0.42	9.43	14	1.48
AUS	1992-07-01	1992-08-01	235140	LL	YFT	1.05	9.43	14	1.48
AUS	1992-07-01	1992-08-01	240140	LL	YFT	0.15	9.43	14	1.48
AUS	1992-08-01	1992-09-01	240140	LL	YFT	0.61	9.43	14	1.48
AUS	1992-11-01	1992-12-01	230120	LL	YFT	0.15	9.43	14	1.48
AUS	1992-11-01	1992-12-01	230125	LL	YFT	0.15	9.43	14	1.48
AUS	1992-11-01	1992-12-01	235115	LL	YFT	1.65	9.43	14	1.48
AUS	1992-11-01	1992-12-01	235135	LL	YFT	5	9.43	14	1.48
AUS	1992-12-01	1993-01-01	235110	LL	YFT	0.08	9.43	14	1.48
AUS	1992-12-01	1993-01-01	240140	LL	YFT	0.12	9.43	14	1.48

4. Raising the df_input_incomplete dataset: Multiply each value from the catch-and-effort dataset by the RF associated.

After the raising, the new dataset for catch-and-effort is the following:

flag	time_start	time_end	geographic_identifier	gear	species	value
AUS	1992-02-01	1992-03-01	235140	LL	YFT	0.074
AUS	1992-06-01	1992-07-01	230125	LL	YFT	0.6216
AUS	1992-07-01	1992-08-01	235140	LL	YFT	1.55
AUS	1992-07-01	1992-08-01	240140	LL	YFT	0.222
AUS	1992-08-01	1992-09-01	240140	LL	YFT	0.9028
AUS	1992-11-01	1992-12-01	230120	LL	YFT	0.222
AUS	1992-11-01	1992-12-01	230125	LL	YFT	0.222
AUS	1992-11-01	1992-12-01	235115	LL	YFT	2.442
AUS	1992-11-01	1992-12-01	235135	LL	YFT	7.4
AUS	1992-12-01	1993-01-01	235110	LL	YFT	0.1184
AUS	1992-12-01	1993-01-01	240140	LL	YFT	0.1776

The parameter threshold_rf allows to remove some data based on the raising factor. In the example, setting threshold_rf = 10 would remove all the data for which RF = 10 ; in other words, any data for which less that 10 per cent of the total catches are available in the catch-and-effort for the stratum set with x_raising_dimensions. Setting NULL will imply that not data are filtered.

The object "stats" of the output list is a data.frame whose columns are the followings. In the list below "sum" is the sum of the data, i.e. the sum of the column 'value' of the dataset:

sum_df_total: sum extracted from df_input_total
sum_df_incomplete_before_raising: sum extracted from df_input_incomplete before the raising
sum_df_incomplete_after_raising: sum extracted from df_input_incomplete after the raising
sum_df_incomplete_do_not_exist_in_df_total: sum extracted from df_input_incomplete that cannot be raised because the strata exists in df_input_incomplete but does not exist in df_input_total (ie sum of the strata exists in df_input_incomplete but not in df_input_total)The raised value for these strata is equal to the raw value.
sum_df_total_do_not_exist_in_df_incomplete: sum extracted from df_input_total that do not exist in df_input_incomplete (ie sum of the strata exists in df_input_total but not in df_input_incomplete)
perc_df_incomplete_over_df_total_before_raising: percentage of the data of df_input_total that were available in df_input_incomplete before the raising
perc_df_incomplete_over_df_total_after_raising: percentage of the data of df_input_total that are available in df_input_incomplete after the raising
perc_df_incomplete_do_not_exist_in_df_total: percentage of the data coming from df_input_incomplete for which there is no correspondance in df_input_total (i.e perc. of the strata that exist in df_input_incomplete but do not exist in df_input_total)
perc_df_total_do_not_exist_in_df_incomplete: percentage of the data coming from df_input_total for which there is no correspondance in df_input_total (i.e perc. of the strata that exist in df_input_total but do not exist in df_input_incomplete)

Usually, for catches: x_raising_dimensions = c("gear","flag","species","year","source_authority","unit") and for efforts: c("gear","flag","year","source_authority","unit").

Value

a list with three objects:

"df": data.frame. Representing df_input_incomplete raised to df_input_total
"stats": data.frame. Information regarding the raising process (see Details for additional information).

Examples


# Connect to Tuna atlas database
con<-db_connection_tunaatlas_world()

# Extract IOTC georeferenced catch time series of catches from Sardara DB
ind_catch_tunaatlasird_level1<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_catch_1952_11_01_2016_01_01_tunaatlasIRD_level1"))
head(ind_catch_tunaatlasird_level1)

# Extract IOTC total (nominal) catch time series from Sardara DB
ind_nominal_catch_tunaatlasiotc_level0<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_nominal_catch_1950_01_01_2015_01_01_tunaatlasIOTC_2017_level0"))
head(ind_nominal_catch_tunaatlasiotc_level0)

## Raise georeferenced catch to total catch. Raise by {gear, flag, species, year, source_authority, unit}

# First calculate the dataset of raising factor
df_rf<-raise_get_rf(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit")
)

# Then raise
ind_catch_tunaatlasird_level2<-raise_incomplete_dataset_to_total_dataset(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
df_rf=df_rf,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit"),
threshold_rf=NULL)

head(ind_catch_tunaatlasird_level2$df)

# get statistics on the raising process
ind_catch_tunaatlasird_level2$stats

dbDisconnect(con)

ptaconet/rtunaatlas documentation built on June 23, 2024, 9:35 p.m.