knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This article provides examples and explanations of computations that can be done with this package.
In the previous articles we described how to get set up with all of the required datasets.
To start, we will load the package and create a variable that points to the data directory that we have set up.
library(champsmortality) data_dir <- file.path(system.file(package = "champsmortality"), "testdata")
Note that this directory fully populated with synthetic data and comes with the package for the purposes of documentation and examples. This package is most useful with real data obtainable from CHAMPS. As such, keep in mind that the results of the examples in this documentation are not meaningful.
We can read the all the datasets in and validate their structure with the following:
d <- read_and_validate_data(data_dir)
The validation ensures that all of the variables required to perform the calculations are present along with additional checks for the content of some of the variables. If there are any issues, you will see an error message and will need to correct the error before being able to use the package with your data.
A function, process_data()
takes the data that has been read and joins it together to create a dataset ready for analysis.
dd <- process_data(d, start_year = 2017, end_year = 2020)
It is convenient to view a list of all sites and catchments that are available in the data. A simple utiltiy function, get_site_info()
can provide this:
get_site_info(dd)
Computations with this data have the goal of finding adjusted mortality fractions and rates for a given condition found in the causal chain. As you will see, these can be specified by either using the condition name or a regular expression indicating ICD10 codes that indicate the condition.
A convenience function that lists all available conditions in the data is provided, valid_conditions()
:
valid_conditions(dd)
This searches the CHAMPS data and finds all unique condition values found anywhere in the causal chain. A ranking is also provided where a higher ranking indicates that the condition is found more frequently in the data than a condition with a lower ranking.
The main function of this package computes factor-adjusted mortality rates and fractions for a specified set of sites and catchments. More details about the methodology can be found in this article.
The function is get_rates_and_fractions()
. It takes several parameters, many with defaults, and details about these parameters can be found in the function reference on this site. This article will cover many examples that illustrate all of the different ways this function can be used.
Below is a very simple example for getting rates and fractions for the condition "Lower respiratory infections":
graf <- get_rates_and_fractions( dd, condition = "Lower respiratory infections", cond_name_short = "LRI" )
This applies the calculations to each site independently. For each site, data for all catchments are combined. To get results for individual catchments, you can call the function with specific catchments using the catchments
argument. The main arguments are the input data, dd
and the condition. Here we specify the optional parameter cond_name_short
to be "LRI". This is used in some of the tabular and graphical outputs.
As seen in the printed output, some sites have no adjustment variables while others are adjusted on location. For site "S7", both "location" and "va" were significant but we can only adjust for one variable or two variables if one of them is "age", so just "location" is chosen.
Note: In the previous article's discussion of the DSS data, we highlighted that because the data was collected in a way that crosses age with all of the other factors, but does not cross the other factors with each other, we can only adjust for one factor or age + another factor. If there are more than two candidate factors for adjustment, precedence is given to factors according to the following ranking:
The output of this function is a named list for each site, and contains many pieces of information for each site including underlying statistics that went into the calculations.
Note: Great care has gone into validating the outputs of the functions in this package, but we strongly encourage the user to examine the underlying statistics and assumptions to ensure the results are correct.
Let's examine at a high level what is returned by this function for site S6:
str(graf$S6, max.level = 1, give.attr = FALSE)
The first few elements are simply documenting the input parameters that went into the calculations. Specifically:
graf$S6$inputs
Many of these inputs are defaults that we didn't specify.
The elements mits
and cond
are tables that capture information that is used in deciding whether any factors should be adjusted for. They contain one row for each factor, e.g.:
graf$S6$mits
Each row contains information that went into the calculation of whether the factor is statistically significantly associated with MITS consent.
The table
variable contains nested tables with additional details. For example, see the table associated with the factor age
:
graf$S6$mits$table[[1]]
We see counts of MITS vs. non-MITS broken down by age. A chi-square test is carried out on this table to determine a p-value. If a factor is significant and has sufficiently non-missing data (default is 20%), it is a candidate for adjustment.
The pop_mits
element reports statistics of number of stillbirths and under-5 deaths plus stillbirths for each catchment in the site.
graf$S6$pop_mits
The output can_use_dss
specifies whether the site consists of all-DSS sites and helps determine how to combine statistics across sites, as described in more detail here.
The outputs acMR_*
provide the all-cause mortality rate from the DSS and DHS data sources that were used in the calculations.
The most interesting outputs are the rates and fractions:
graf$S6$rate graf$S6$frac
The additional outputs rate_*
provide additional information about what went into the rate and fraction calculations.
There are other noteworthy inputs to get_rates_and_fractions()
. For full documentation you can refer to the function reference on this site.
Suppose we want to calculate rates and fractions for just site "S6" for the condition malaria, not in the causal chain. In this case, we don't want to consider data from stillbirths:
res <- get_rates_and_fractions( dd, sites = "S6", causal_chain = FALSE, condition = "Malaria", factor_groups = list(age = list( Neonate = "Neonate", Infant = "Infant", Child = "Child" )) )
Here, to specify that we are not looking in the causal chain, we set causal_chain = FALSE
(default is TRUE
).
To specify that we do not want to include stillbirths, in the previous example we provided an argument factor_groups
. This argument is taken as a named list, with each possible entry relating to one of the factors. Inside the list, a mapping can be made with the left-hand-side indicating the name of the new factor level and the right-hand-side indicating what levels of the original factor to include in that calculation.
For example, we could specify that we want to define a new breakdown of age with one category being "Infant" that includes "Neonate" and "Infant" and another, "Child" that maps to the existing category "Child". Additionally, suppose we want to break education down into two groups, "Lower" - comprised of "None" and "Primary", and "Upper" - comprised of "Secondary" and "Tertiary". We could do this by specifying the following as the input for factor_groups
:
list( age = list( Infant = c("Neonate", "Infant"), Child = "Child" ), education = list( "Lower" = c("None", "Primary"), "Upper" = c("Secondary", "Tertiary") ) )
Such groupings can be useful if dealing with a condition that has very low counts.
If a category of one of the factors is not provided in the specification, cases with that factor level are excluded from the analysis. For example, in the previous examples, the calculations will not include stillbirths.
In this example, we are only looking at neonates:
npbc_cc <- get_rates_and_fractions(dd, sites = "S1", condition = "Neonatal preterm birth complications", cond_name = "NPBC", causal_chain = TRUE, factor_groups = list(age = list(Neonate = "Neonate")) )
In the previous examples, results have been applied to all catchments within each site. If we wish to perform the initial analysis but only obtain estimates for one catchment, "C1" of site "S6":
graf <- get_rates_and_fractions( dd, sites = "S6", catchments = "C1", condition = "Lower respiratory infections", cond_name_short = "LRI" )
Suppose we want to obtain results for just a subset of sites and catchments:
graf <- get_rates_and_fractions( dd, sites = c("S5", "S6"), catchments = c("C1", "C4", "C5"), condition = "Lower respiratory infections", cond_name_short = "LRI" )
Suppose you have calculated rates and fractions and examined the underlying data and determined that even though a factor wasn't significant, it is close enough and you want to force the computations to adjust for the factor, you can use the adjust_vars_override
argument. For example, suppose we want to force one of our previous examples to adjust on "age" and "location":
graf <- get_rates_and_fractions( dd, sites = "S6", catchments = "C1", condition = "Lower respiratory infections", cond_name_short = "LRI", adjust_vars_override = c("age", "location") )
Conditions can be specified by using a valid value from valid_conditions(dd)
or through a regular expression specifying ICD10 codes that define the condition, or both.
If causal_chain = TRUE
, the condition of ICD10 codes are checked against underlying, immediate, and morbid conditions. If FALSE
, just the underlying is checked.
For example, to get statistics for neural tube defects, we can specify a search for all cases with ICD10 codes that begin with "Q00", "Q01", or "Q05". To specify this as a regular expression, we can provide a string "^Q00|^Q01|^Q05". Here, "^" means "starts-with", and "|" means "or".
ntd_cc <- get_rates_and_fractions( dd, icd10_regex = "^Q00|^Q01|^Q05", cond_name = "NTD+CBD" )
If both condition
and icd10_regex
are provided, they are both searched for and all cases that meet either the condition or the ICD10 specification are included. So for example if we wanted to look at rates and fractions for cases where the underlying cause is either the condition "Congenital birth defects" or NTD, we can do the following:
ntd_cbd_cc <- get_rates_and_fractions( dd, condition = "Congenital birth defects", icd10_regex = "^Q00|^Q01|^Q05", cond_name = "NTD+CBD" )
Note that you can also provide multiple conditions, in which case the calculation will be for cases that have any of the specified conditions:
sep_uc <- get_rates_and_fractions( dd, sites = "S1", causal_chain = FALSE, condition = c("Sepsis", "Neonatal sepsis"), cond_name = "Sepsis" )
We can also calculate rates and fractions for maternal conditions. A list of valid maternal conditions can be obtained with the following:
valid_maternal_conditions(dd)
Suppose we want to calculate statistics for "Maternal hypertension". To do this, we simply provide this as the condition
and specify maternal = TRUE
. This lets the function know to look in the maternal conditions. Note that in this case, the value of causal_chain
does not apply and is ignored.
htn_s1 <- get_rates_and_fractions(dd, sites = "S1", condition = "Maternal hypertension", maternal = TRUE, cond_name = "HTN" )
In the case that you would like to compute rates and fractions for a large number of site/catchment/condition/etc. combinations, you can use a function batch_rates_and_fractions()
. This takes an input csv such as the following:
inputs_csv <- tempfile(fileext = ".csv") writeLines( c( "site,catchment,age,condition,icd10_regex,causal_chain,maternal", "S6,C1;C2,Neonate;Infant;Child,Perinatal asphyxia/hypoxia,,TRUE,FALSE", "S6,C1,Stillbirth,Congenital birth defects,,FALSE,FALSE" ), inputs_csv )
This identifies two scenarios we want to run, the first being Perinatal asphyxia/hypoxia for age groups "Neonate", "Infant", and "Child" in the causal chain for site "S6" and only its catchments "C1" and "C2". Note that when an input has multiple values, we concatenate them with a semicolon.
The function batch_rates_and_fractions()
takes these inputs and runs them through get_rates_and_fractions()
.
bat1 <- batch_rates_and_fractions(dd, inputs_csv)
The output is a list of results of the type we have seen previously.
The csv approach can be nice for succinctly identifying a large number of scenarios to run, particularly if we want to get rates and fractions for each individual catchment of several sites, or if we want to look at several conditions, such as the top 10.
inputs2 <- list( list( site = "S6", catchment = c("C1", "C2"), age = c("Neonate", "Infant", "Child"), condition = "Perinatal asphyxia/hypoxia", icd10_regex = NULL, causal_chain = TRUE, maternal = FALSE ), list( site = "S6", catchment = "C1", age = "Stillbirth", condition = "Congenital birth defects", icd10_regex = NULL, causal_chain = FALSE, maternal = FALSE ) ) bat2 <- batch_rates_and_fractions(dd, inputs2)
A drawback of the batch processing approach is that we do not quite have full control over all input parameters, such as custom factor groupings, etc. One could write their own custom script to iterate through scenarios in the rare occasions that more detailed specification is required.
After we have processed several scenarios, it can be useful to conslidate the output into a simple table with one row per scenario and all of the relevant statistics. This can be done with a function rates_fracs_to_df()
.
For example, to turn our previous batch processing result into a table, we can do the following
tbl <- rates_fracs_to_df(bat1) dplyr::glimpse(tbl)
Note that in addition to results of batch processing, we can also apply this function to the results of get_rates_and_fractions()
.
For example, in our first example of this article, we calculated LRI statistics for every site and catchment and stored the result in an object named graf
.
We can consolidate this output with the following:
rates_fracs_to_df(graf)
The next article provides examples functions available in this package for creating nicely-formatted tables and figures of the outputs we have learned to create in this article.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.