View source: R/raise_incomplete_dataset_to_total_dataset.R
raise_incomplete_dataset_to_total_dataset | R Documentation |
Provided two datasets, one for which the information information is usually stratified is space and/or time but the associated measure represents only part of the reality ("incomplete" dataset), and one for wich the the information is usually less stratified (i.e. more aggregated) but the measure represents the reality ("total" dataset): this function raises the "incomplete" dataset to the "total" dataset, using raising factors. Raising factors are the proportion of data that of the "total" dataset that are available in the "incomplete" dataset (see section "Details" and function raise_get_rf for additional information).
raise_incomplete_dataset_to_total_dataset(df_input_incomplete, df_input_total,
df_rf, x_raising_dimensions, decrease_when_rf_inferior_to_one = TRUE,
threshold_rf = NULL)
df_input_incomplete |
data.frame "incomplete", to raise. Must have a set of dimensions (columns) + a value column |
df_input_total |
data.frame "total". Must have a set of dimensions (columns) + a value column |
df_rf |
data.frame of raising factors. Ouput of the function raise_get_rf |
x_raising_dimensions |
vector of dimensions (i.e. dimensions that compose the stratum) to use for the computation of the raising factors. The dimensions must be available in both input data.frames. |
decrease_when_rf_inferior_to_one |
boolean. If the raising factor is inferior to 1 (i.e. data in |
threshold_rf |
numeric from 0 to 100. If the raising factor for a stratum is not above this treshold, the data will be removed from the dataset. |
It is possible to understand the concept of raising with the following example.
Let us take a first dataset df_input_incomplete
representing the catches for the stratum defined by Flag=AUS, Year=1992, Gear=LL, Species=YFT , stratified by 5° quadrants and 1 month time resolution (typical catch-and-effort dataset):
flag | time_start | time_end | geographic_identifier | gear | species | value |
AUS | 1992-02-01 | 1992-03-01 | 235140 | LL | YFT | 0.05 |
AUS | 1992-06-01 | 1992-07-01 | 230125 | LL | YFT | 0.42 |
AUS | 1992-07-01 | 1992-08-01 | 235140 | LL | YFT | 1.05 |
AUS | 1992-07-01 | 1992-08-01 | 240140 | LL | YFT | 0.15 |
AUS | 1992-08-01 | 1992-09-01 | 240140 | LL | YFT | 0.61 |
AUS | 1992-11-01 | 1992-12-01 | 230120 | LL | YFT | 0.15 |
AUS | 1992-11-01 | 1992-12-01 | 230125 | LL | YFT | 0.15 |
AUS | 1992-11-01 | 1992-12-01 | 235115 | LL | YFT | 1.65 |
AUS | 1992-11-01 | 1992-12-01 | 235135 | LL | YFT | 5 |
AUS | 1992-12-01 | 1993-01-01 | 235110 | LL | YFT | 0.08 |
AUS | 1992-12-01 | 1993-01-01 | 240140 | LL | YFT | 0.12 |
The second dataset df_input_total
represents the catches for the same stratum (Flag=AUS, Year=1992, Gear=LL, Species=YFT), however the data is more aggregated: the tuna RFMO area of competence and 1 year time resolution (typical nominal catch dataset):
flag | time_start | time_end | geographic_identifier | gear | species | value |
AUS | 1992-02-01 | 1993-01-01 | IOTC | LL | YFT | 14 |
The information in both datasets is the same, however, in the dataset df_input_incomplete
, only a sample of the catches has been reported. The catches reported in df_input_total
are the real catches that happened in this stratum.
Raising the dataset df_input_incomplete
to the dataset df_input_total
with x_raising_dimensions
= c("flag","year","gear","species") means:
1. Calculating S2: Summing the data from df_input_incomplete
by year (i.e. sum of the months) for each stratum (i.e. combination of flag, year, gear, species)). In the example, S2 = 9.43
At this stage, we get the following table:
flag | time_start | time_end | geographic_identifier | gear | species | value | sum_incomplete_catch |
AUS | 1992-02-01 | 1992-03-01 | 235140 | LL | YFT | 0.05 | 9.43 |
AUS | 1992-06-01 | 1992-07-01 | 230125 | LL | YFT | 0.42 | 9.43 |
AUS | 1992-07-01 | 1992-08-01 | 235140 | LL | YFT | 1.05 | 9.43 |
AUS | 1992-07-01 | 1992-08-01 | 240140 | LL | YFT | 0.15 | 9.43 |
AUS | 1992-08-01 | 1992-09-01 | 240140 | LL | YFT | 0.61 | 9.43 |
AUS | 1992-11-01 | 1992-12-01 | 230120 | LL | YFT | 0.15 | 9.43 |
AUS | 1992-11-01 | 1992-12-01 | 230125 | LL | YFT | 0.15 | 9.43 |
AUS | 1992-11-01 | 1992-12-01 | 235115 | LL | YFT | 1.65 | 9.43 |
AUS | 1992-11-01 | 1992-12-01 | 235135 | LL | YFT | 5 | 9.43 |
AUS | 1992-12-01 | 1993-01-01 | 235110 | LL | YFT | 0.08 | 9.43 |
AUS | 1992-12-01 | 1993-01-01 | 240140 | LL | YFT | 0.12 | 9.43 |
2.Calculating S1: Summing the data from df_input_total
by year for each stratum (i.e. combination of flag, year, gear, species)). In the example, S1 = 14
3. Calculating the raising factor RF for each stratum: RF = S1/S2. This is done with the function raise_get_rf. RF is the proportion of total catch that are available in the catch-and-effort. In the example, RF = 14/9.43 = 1.48. This means that (1 / RF * 100 = ) 67 At this stage, we get the following table:
flag | time_start | time_end | geographic_identifier | gear | species | value | sum_incomplete_catch | sum_total_catch | RF |
AUS | 1992-02-01 | 1992-03-01 | 235140 | LL | YFT | 0.05 | 9.43 | 14 | 1.48 |
AUS | 1992-06-01 | 1992-07-01 | 230125 | LL | YFT | 0.42 | 9.43 | 14 | 1.48 |
AUS | 1992-07-01 | 1992-08-01 | 235140 | LL | YFT | 1.05 | 9.43 | 14 | 1.48 |
AUS | 1992-07-01 | 1992-08-01 | 240140 | LL | YFT | 0.15 | 9.43 | 14 | 1.48 |
AUS | 1992-08-01 | 1992-09-01 | 240140 | LL | YFT | 0.61 | 9.43 | 14 | 1.48 |
AUS | 1992-11-01 | 1992-12-01 | 230120 | LL | YFT | 0.15 | 9.43 | 14 | 1.48 |
AUS | 1992-11-01 | 1992-12-01 | 230125 | LL | YFT | 0.15 | 9.43 | 14 | 1.48 |
AUS | 1992-11-01 | 1992-12-01 | 235115 | LL | YFT | 1.65 | 9.43 | 14 | 1.48 |
AUS | 1992-11-01 | 1992-12-01 | 235135 | LL | YFT | 5 | 9.43 | 14 | 1.48 |
AUS | 1992-12-01 | 1993-01-01 | 235110 | LL | YFT | 0.08 | 9.43 | 14 | 1.48 |
AUS | 1992-12-01 | 1993-01-01 | 240140 | LL | YFT | 0.12 | 9.43 | 14 | 1.48 |
4. Raising the df_input_incomplete
dataset: Multiply each value from the catch-and-effort dataset by the RF associated.
After the raising, the new dataset for catch-and-effort is the following:
flag | time_start | time_end | geographic_identifier | gear | species | value |
AUS | 1992-02-01 | 1992-03-01 | 235140 | LL | YFT | 0.074 |
AUS | 1992-06-01 | 1992-07-01 | 230125 | LL | YFT | 0.6216 |
AUS | 1992-07-01 | 1992-08-01 | 235140 | LL | YFT | 1.55 |
AUS | 1992-07-01 | 1992-08-01 | 240140 | LL | YFT | 0.222 |
AUS | 1992-08-01 | 1992-09-01 | 240140 | LL | YFT | 0.9028 |
AUS | 1992-11-01 | 1992-12-01 | 230120 | LL | YFT | 0.222 |
AUS | 1992-11-01 | 1992-12-01 | 230125 | LL | YFT | 0.222 |
AUS | 1992-11-01 | 1992-12-01 | 235115 | LL | YFT | 2.442 |
AUS | 1992-11-01 | 1992-12-01 | 235135 | LL | YFT | 7.4 |
AUS | 1992-12-01 | 1993-01-01 | 235110 | LL | YFT | 0.1184 |
AUS | 1992-12-01 | 1993-01-01 | 240140 | LL | YFT | 0.1776 |
The parameter threshold_rf
allows to remove some data based on the raising factor. In the example, setting threshold_rf = 10 would remove all the data for which RF = 10 ; in other words, any data for which less that 10 per cent of the total catches are available in the catch-and-effort for the stratum set with x_raising_dimensions
. Setting NULL will imply that not data are filtered.
The object "stats" of the output list is a data.frame whose columns are the followings. In the list below "sum" is the sum of the data, i.e. the sum of the column 'value' of the dataset:
sum_df_total: sum extracted from df_input_total
sum_df_incomplete_before_raising: sum extracted from df_input_incomplete
before the raising
sum_df_incomplete_after_raising: sum extracted from df_input_incomplete
after the raising
sum_df_incomplete_do_not_exist_in_df_total: sum extracted from df_input_incomplete
that cannot be raised because the strata exists in df_input_incomplete
but does not exist in df_input_total
(ie sum of the strata exists in df_input_incomplete
but not in df_input_total
)The raised value for these strata is equal to the raw value.
sum_df_total_do_not_exist_in_df_incomplete: sum extracted from df_input_total
that do not exist in df_input_incomplete
(ie sum of the strata exists in df_input_total
but not in df_input_incomplete
)
perc_df_incomplete_over_df_total_before_raising: percentage of the data of df_input_total
that were available in df_input_incomplete
before the raising
perc_df_incomplete_over_df_total_after_raising: percentage of the data of df_input_total
that are available in df_input_incomplete
after the raising
perc_df_incomplete_do_not_exist_in_df_total: percentage of the data coming from df_input_incomplete
for which there is no correspondance in df_input_total
(i.e perc. of the strata that exist in df_input_incomplete
but do not exist in df_input_total
)
perc_df_total_do_not_exist_in_df_incomplete: percentage of the data coming from df_input_total
for which there is no correspondance in df_input_total
(i.e perc. of the strata that exist in df_input_total
but do not exist in df_input_incomplete
)
Usually, for catches: x_raising_dimensions
= c("gear","flag","species","year","source_authority","unit") and for efforts: c("gear","flag","year","source_authority","unit").
a list with three objects:
"df": data.frame. Representing df_input_incomplete
raised to df_input_total
"stats": data.frame. Information regarding the raising process (see Details for additional information).
Other process data: convert_units
,
create_calendar
, create_grid
,
get_rfmos_datasets_level0
,
map_codelist
,
raise_datasets_by_dimension
,
raise_get_rf
,
rasterize_geo_timeseries
,
spatial_curation_downgrade_resolution
,
spatial_curation_intersect_areas
,
spatial_curation_reallocate_data
,
spatial_curation_upgrade_resolution
# Connect to Tuna atlas database
con<-db_connection_tunaatlas_world()
# Extract IOTC georeferenced catch time series of catches from Sardara DB
ind_catch_tunaatlasird_level1<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_catch_1952_11_01_2016_01_01_tunaatlasIRD_level1"))
head(ind_catch_tunaatlasird_level1)
# Extract IOTC total (nominal) catch time series from Sardara DB
ind_nominal_catch_tunaatlasiotc_level0<-extract_dataset(con,list_metadata_datasets(con,identifier="indian_ocean_nominal_catch_1950_01_01_2015_01_01_tunaatlasIOTC_2017_level0"))
head(ind_nominal_catch_tunaatlasiotc_level0)
## Raise georeferenced catch to total catch. Raise by {gear, flag, species, year, source_authority, unit}
# First calculate the dataset of raising factor
df_rf<-raise_get_rf(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit")
)
# Then raise
ind_catch_tunaatlasird_level2<-raise_incomplete_dataset_to_total_dataset(
df_input_incomplete=ind_catch_tunaatlasird_level1,
df_input_total=ind_nominal_catch_tunaatlasiotc_level0,
df_rf=df_rf,
x_raising_dimensions=c("gear","flag","species","year","source_authority","unit"),
threshold_rf=NULL)
head(ind_catch_tunaatlasird_level2$df)
# get statistics on the raising process
ind_catch_tunaatlasird_level2$stats
dbDisconnect(con)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.