knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(datapackcommons)
library(tidyverse)

What are the DataPack model and PSNUxIM model or distribution

The DataPack model output is an .rds with historic target and result data used for building and validating DataPacks. See usage here datapackr::packDataPack and here datapackr::checkAnalytics. Production versions of these models are stored in sharepoint and S3. The DataPack model includes historic data aggregating across mechanisms (including dedupes) to the PSNU level. So in the case of results this means we are aggregating from site and community data.

The PSNUxIM model output is an .rds file with historic target data broken out by mechanism. It is used for populating the PSNUxIM tab of the data pack with initial allocations of PSNU level targets to mechanisms including deduplication.

Recent versions of this files can be found on sharepoint.

And here is a listing of the production support files on S3 as of 2022-05-18. Notice that we only keep the most current production version on S3, and the name remains stable. There is in the model generation scripts to post to S3 testing and prod environments.

aws-vault exec datapack-prod -- aws s3 ls prod.pepfar.data.datapack/support_files/

2021-02-04 23:36:17 0

2022-05-10 14:26:18 3533160 datapack_model_data.rds

2022-05-16 21:15:17 4902501 psnuxim_model_data_21.rds

2022-05-16 18:58:52 5991842 psnuxim_model_data_22.rds

Generating a DataPack model

There is a script in datapackcommons data-raw/model_calculations.R. Commented out at the end of the script is valid code for writing the output locally and to S3. Currently the main output of the script is a nested data pack model, but this isn't what datapackr uses, so we must flatten this model using flattenDataPackModel_21. In the TODO we have an item to change this so that the flatten version is the base model and we stop creating a nested version.

Note that in addition to the functions set up in the script, there are also functions included as part of the package. Ultimately all of the required functions should be moved over to datapackr as proper package functions.

There is also a function diffDataPackModels that allows you to do a diff between two different flattened DataPack models. This is very useful for refactoring because we can compare a full before and after version of the model to make sure code changes don't lead to (unexpected) model output changes.

Running the script start to finish (without uncommenting the writing of files), you will have a new version of the data pack model in an object called cop_data. This is the unflattened version which has some information useful in debugging, but this could be create separate from the data file itself.

Generating a PSNUxIM distribution/model

There is a script in datapackcommons data-raw/SnuxImDistMain.R. It is currently possible to produce a PSNUxIM model for COP21 or COP22, and that variable cop_year is set at the top of the script. This script also contains code for comparing two versions of the PSNUxIM distribution, but it is inline at the bottom instead of in a function. It also contains code for sending the file to S3 commented out.

The the final R object created by this script is called data. When written as an RDS, these are the data used by datapackr to populate the PSNUxIM tab.

Generating configuration files

There are 2 excel files used to configure the DataPack and PSNUxIM models.

  1. data-raw/model_calculations/model_calculations.xlsx

    This contains a sheet with configuration details for pulling the historic data required for the DataPack (data_required sheet) as well as details on disaggregations required for the DataPack (dimension_item_sets sheet), which may include aggregation or disaggregation compared to what is actually in DATIM. I will provide more details on the configuration below. 1. data-raw/snu_x_im_distribution_configuration/22Tto23TMap.xlsx

    This workbook contains 1 sheet with the configuration details for generating the PSNUxIM distribution. more details on the configuration are provided below.

The three worksheets in these two excel files are saved in csv format for version control purposes, and these three csv files are converted to .rds files for inclusion on the package with the script data-raw/CreateDataFiles.R. This script includes some checks to ensure the internal consistency of the configuration (e.g. names and uids match) and it also provides some comparison of the data rds files to generated with the versions currently in the package. This allows us to ensure any configuration changes are as intended.

Practically what this means is that the process for updaing the DataPack model and the PSNUsxIM distribution model requires three things.

  1. Update the relevant sheet(s) in data-raw/model_calculations/model_calculations.xlsx or data-raw/snu_x_im_distribution_configuration/22Tto23TMap.xlsx and save the xlsx file.

    NOTE: for next year the distribution file would be named 23Tto24TMap.xlsx if we keep the pattern the same.
    1. Save the updated Excel sheets as individual csv files - data-raw/model_calculations/dimension_item_sets.csv - data-raw/model_calculations/data_required.csv - data-raw/snu_x_im_distribution_configuration/22Tto23TMap.csv 1. Run the script that validates the confiuration files (csv versions) and creates the RDS files - data-raw/CreateDataFiles.R

Configuring dimension_item_sets

You should familiarize yourself with the DHIS2 API analytics endpoint in order to fully understand the content below.

The data in the DataPack is generally organized with an age and sex, or a key population in the rows (EID and GEND don't have row dissagregates, age is implied in EID columns). Additionally the columns often also have an implied disaggregation of some kind. For example HTS_TST.KP.Pos.T_1 is disaggregated by key population status and an implied disagregation (analytics filer) on HIV test +. TB_STAT.N.New.Pos.T_1 is disaggregated age: 1-65+, sex Male/Female and has an implied disagg on HIV Test Status (Specific) = Newly Tested Positives (Specific). This is similar for the PSNUxIM distribution.

The dimension item sets configuration is used to specify which disaggregations are used to pull data from the DHIS2 analytics endpoint for the DataPack Model and the PSNUxIM distribution. Each Column in the DataPack and unique indicator code in the PSNUxIM distributions can be associated with up to 3 dimension item or model sets. Age and/or sex and/or other disagg; or KP and/or other disagg.

An important aspect of the dimension item set configuration is it allows us to specify special situations for splitting historic data into new disaggregations for the datapack or aggregating historic data for use in the datapack.

Dimension item set examples

Standard key population disaggregation 1:1 dimension items to category options

This is the most basic example, we are pulling data from the Key Populations v3 dimension, we include all of the dimension items under this dimension. As this dimension is actually a category, the category option uids and the dimension item uids match. The weights are all equal to 1 because we aren't aggregating or disaggeregating the data further for the data pack. The model_set here is kp1 so that is the foreign key we will use in the other configuration files to pull data with these disaggregations.

https://www.datim.org/api/dimensions?filter=name:eq:Key%20Populations%20v3&fields=:all,items[name,id] https://www.datim.org/api/categories?filter=name:eq:Key%20Populations%20v3&fields=name,id,categoryOptions[name,id]

dplyr::filter(datapackcommons::dim_item_sets, model_sets == "kp1") %>% as.data.frame()
option_name = NA

1st note that in all cases where option_name = NA the dim_cop_type = other_disagg, so we are not dealing with explicit age, sex, kp disaggregates to appear in DataPack rows, but rather implicit disaggregations as part of the data pack indicator/target.

Ages with additional distribution to new ages

Historically the highest age category has been 50+, but with COP22 and 23 the finer age bands are being use creating 50-54, 55-59, 60-64, and 65+. For the purposes of the datapack we need to be allocate or distribute historic data for the 50+ age category to these new age bands. Here is an example of a model set that does this:

dplyr::filter(datapackcommons::dim_item_sets, model_sets == "15-65+") %>% as.data.frame()

https://www.datim.org/api/dimensions?filter=name:eq:Age:%20Cascade%20Age%20bands&fields=:all,items[name,id]

Some important things to note, this dimension does NOT come from a category, it is a a category group set. The dimension items are not necessarily category options, and the name and uids differ from category options. For instance the dimension item is <item name="10-14 (Specific)" id="tIZRQs0FK5P"/>, but the underlying category option is <categoryOption name="10-14" id="jcGQdcpPSJP"/>

https://www.datim.org/api/categoryOptionGroups/tIZRQs0FK5P?fields=:all,categoryOptions[name,id]

Also the 15-65+ model set does not include all of the dimension items from the Age: Cascade Age bands dimension/category group set.

In the configuration we can see that the 50+ dimension item appears 4 times in order to allocate the 50+ data to the 4 new age categories. This distribution of data happens in the data pack commons code. The 50+ data is proportionally distributed to 50-54, 55-59, 60-64, 65+ with weights .42, .35, .14, .09 (sum to 1), respectivly.

Each model_set is a combination of one or more analytics dimensions and dimension items used in a DHIS2 analytics call. Most columns of dimension_item_sets are described here in the docs (?datapackcommons::dim_item_sets). The column model_sets in the dimension_items_sets excel sheet is semi-colon delimited and gets unnested in the package object datapack::dim_item_sets. A model set is the foreign key for the groupings that are used in the data_required and PSNUxIM model configurations to link the configuration to the ages, sexes, kps and other disaggs.

For every column of the datapack there will be one or more model sets

Configuring a PSNUxIM distribution

The PSNUxIM distribution provides data for allocation targets set in the main data pack tabs to mechanisms on the PSNUxIM tab. We use historic data, specifically prior year targets data, to create these allocations. We take the list of targets for the cop year and map those to targets from the prior year. In most cases this is a direct mappingto the same data element from the previous year. In the case of completely new indicators, however, we sometimes need to map to a different indicator from the prior year. If there is really no good historic data to link a new indicator too, we sometimes leave it out of the PSNUxIM distribution.

So the configuration requires a reference to a target data element from the previous year. We specify the data element using its technical area, num/den, and disaggregation type. Unfortunately I cannot recall exactly why we chose to use these data element groups and group sets instead of using data element uids directly. It is perhaps slightly easier to maintain as is, if the data element changes it is usually only the disaggregation type that requires updating.

Examples

OVC_SERV.DREAMS.T

OVC_SERV.DREAMS.T broke the pattern of mapping a data apck target to a single pair of DSD and TA data elements from the previous year.

Configuring a DataPack Model

The DataPack model is used to populate certain columns of the DataPack with historic data in DATIM. Usually the target data for the current fiscal year/preceding COP year, and the results data from the prior fiscal year. So for COP 23 this means the historic data for the most part will be FY23/COP22 targets and FY22 results. The only edge case at the moment is TB_PREV.N.R which aggregates the last 5 years of data for COP22 (perhaps the prior 6 years for COP23).

In the datapack columns that take historic data are categorized as mer/past or datapack/calculation, e.g.

head(dplyr::filter(datapackr::cop22_data_pack_schema,
              (dataset == "mer" & col_type == "past") |
              (dataset == "datapack" & col_type == "calculation")))

Usually an indicator code with T_1 indicates the prior cop year (so T_1 in COP23 = FY23/COP22 targets = 2022Oct period) and R indicates results from the last fiscal year (so R in COP23 indicates FY22 results = 2021Oct period).

Consider the PMTCT tab of the datapack

Replicate vs distribute We never distribute

We can look at indicators present and missing in both the schema and model:

your_model <- readRDS("...your_model")
binded_model <- dplyr::bind_rows(your_model)

valid_schema_indicators <-
  filter(datapackr::cop24_data_pack_schema,
         (dataset == "mer" & col_type == "past") |
           (dataset == "datapack" & col_type == "calculation")) %>%
  select(indicator_code) %>%
  distinct()

# indicators in schema not in model
dplyr::anti_join(
  valid_schema_indicators,
  binded_model,
  by = "indicator_code"
)

#           indicator_code
# 1 HTS_TST.KP.Pos.Yield.T

# indicators in model not in schema
dplyr::anti_join(
  binded_model %>% select(indicator_code),
  valid_schema_indicators,
  by = "indicator_code"
) %>%
  distinct()

#             indicator_code
# 1: IMPATT.PRIORITY_SNU.T_1

Note that while the code above may show indicators which are missing, this does not mean they need to be added, check previous years to make sure there is already a reason for a missing indicator, in this case HTS_TST.KP.Pos.Yield.T is noted as an indicator_code we did not currently need to add to the model.

Once we know all indicator are present we can move on to checking the age/sex/kp disaggs. There is a handy function in the package pivotSchemaCombos to help us isolate problem areas. Below is an example analysis that extracts all possible schema combos and checks the model to identify ones missing data in the model (results are included hashed as model and schema may change):

# analysis of age disaags missing in model data ----
your_model <- readRDS("your model...")
binded_model <- bind_rows(your_model)

# label model data as present based on value
binded_model_e <-
  binded_model %>%
  mutate(
    has_data = ifelse(!is.na(value), TRUE, FALSE)
  ) %>%
  mutate(has_data = replace(has_data, value == 0, FALSE))

# pivot schema disaggs
valid_schema_combos <- pivotSchemaCombos(cop_year = 2024)
head(valid_schema_combos)

#  indicator_code               valid_ages valid_sexes valid_kps age_option_uid sex_option_uid kp_option_uid
#   <chr>                        <chr>      <chr>       <chr>     <chr>          <chr>          <chr>
# 1 HTS_TST.Pos.Total_With_HEI.R <01        Female      NA        sMBMO5xAq5T    Z1EnpTPaUfq    NA
# 2 HTS_TST.Pos.Total_With_HEI.R <01        Male        NA        sMBMO5xAq5T    Qn0I5FbKQOA    NA
# 3 HTS_TST.Pos.Total_With_HEI.R 01-09      Female      NA        A9ddhoPWEUn    Z1EnpTPaUfq    NA
# 4 HTS_TST.Pos.Total_With_HEI.R 01-09      Male        NA        A9ddhoPWEUn    Qn0I5FbKQOA    NA
# 5 HTS_TST.Pos.Total_With_HEI.R 10-14      Female      NA        jcGQdcpPSJP    Z1EnpTPaUfq    NA
# 6 HTS_TST.Pos.Total_With_HEI.R 10-14      Male        NA        jcGQdcpPSJP    Qn0I5FbKQOA    NA

#View(valid_schema_combos)

# combos in schema not in the data?
missing_schema_combos <- anti_join(
  valid_schema_combos,
  binded_model_e %>%
    filter(has_data == TRUE) %>%
    select(-value, -psnu_uid, -period) %>%
    distinct()
)

Using the pivoted schema combos we can run an anti_join against the model data that is valid to see what combos are missing from the model output.

Key Files/Directories

catalog <- tibble::tribble(~name, ~short_description,
                file.path("data-raw", "model_calculations.R"),
                "Script for generating datapack model",
                file.path("data-raw", "SnuxImDistMain.R"),
                "Script for generating PSNUxIM models")
knitr::kable(catalog)

Automated Reporting

Automated datapack and psnuxim model jobs were created on rstudio connect with the intention of having models produced regularly during COP season as imports become regular. Automated reports source the model production scripts respectively and produce a report on differences comparing to the latest production model in testing S3 bucket e.g. support_files/datapack_model_data.rds. These reports are available here:

  1. https://rstudio-connect.testing.ap.datim.org/psnuxim_model_report/
  2. https://rstudio-connect.testing.ap.datim.org/datapack_model_report/

To make minor changes or edits to the reports and automation, changes are made to datapack_model_job.Rmd or psnuxim_model_job.Rmd. More in depth changes must be made to the original scripts they source: model_calculations.R or SnuxImDistMain.R. Once all your changes are made, approved and in master, you can then go on rstudio workbench and as an authorized collaborator republish the report via the blue publish button. Rsconnect DCF files in the rsconnect folder ensure publishing occurs with the same target report on rstudio connect.

TODO

Short term
later

Thoughts and ideas



pepfar-datim/data-pack-commons documentation built on April 26, 2024, 12:09 a.m.