knitr::opts_chunk$set(echo = FALSE,message=FALSE, warning=FALSE, fig.retina=8)
Determine the accuracy of machine learning models for predicting if apnea during nurse-administered procedural sedation will be prolonged (>30 seconds).
Secondary analysis of an observational study.
A selection of candidate models were evaluated. Out of sample accuracy was calculated using 10-fold cross-validation. Decision analysis was used to decide if the models were better, in terms of false positive and negatives, than alternative alarm management strategies, being: an aggressive approach (alarm after 15 seconds); or a conservative approach (alarm after 30 seconds).
A total of 384 apneas longer than 15 seconds were observed in 61 of the 102 participants. Nearly half of the apneas (n=180) were prolonged. The random forest had the best discrimination (AUROC 0.66) and calibration. Net benefit associated with a random forest model exceeded the aggressive alarm management strategy but was lower than the conservative alarm management strategy.
Keywords
Procedural sedation and analgesia; conscious sedation; nursing; informatics; patient safety; machine learning; capnography; anesthesia
\pagebreak
With the recent increase in electronic monitoring devices in the hospital setting, alarm fatigue has become a serious problem that impacts patient safety and nursing care.[@lewandowska2020impact] Alarm fatigue is caused by exposure to excessive and frequent device alarms and consequently becoming desensitized to them. Alarm fatigue has been linked to patient deaths resulting from clinicians becoming desensitized to alarms leading to delayed responses to clinical deterioration.[@chopra2014redesigning] One of the sources of alarms is the capnography device that is used to measure and monitor patients’ ventilation.
A capnography waveform displays the level of expired carbon dioxide (CO~2~) over time to show changes in concentrations throughout the respiratory cycle. Capnography waveform abnormalities assist in detecting and diagnosing specific conditions, such as partial airway obstruction and apnea. For this reason, implementing capnography into practice for respiratory monitoring is considered a high priority to improve patient safety by leading authorities, including national and international professional organizations for anesthesiology in Canada, the United States, and Europe.[@apfelbaum2018; @hinkelbeinEuropeanSocietyAnaesthesiology2018; @Dobson2018] It is commonly used for nurse-administered procedural sedation, [@Conway_2014; @conway2013; @conway2013a] including in the interventional radiology setting.[@pella2018systematic; @brast2016capnography; @long2016capnography; @mafeld2020avoiding; @werthman2020administration]
Deciphering which capnography waveform abnormalities deserve intervention (and therefore alarms to signal the event to clinicians) from those that do not is an essential step towards successfully implementing this technology into practice. For example, triggering alarms after short periods of apnea leads to frequent interruptions and potentially increases the risk of alarm fatigue. Conversely, only intervening once an apneic period reaches a longer threshold negates capnography's potential benefits on patient safety by improving ventilation. In practice, two alternative strategies for capnography alarm management are typically used. The 'aggressive' strategy is for alarms to be triggered after short periods of apnea (e.g. 15 seconds). The 'conservative' approach triggers the alarm only once the patient has been apneic for a prolonged period (e.g. 30 seconds). Preferences for the aggressive or conservative alarm threshold will be influenced by many factors, including the rate of oxygen supplmentation. The duration of time between the onset of apnea to hypoxemia increases with higher flows of oxygen.[@wong2019high]
It is possible that capnography alarm management may be improved by using machine learning to create a ‘smart alarm’ that can alert clinicians for apneic events that are predicted to be prolonged. Such an approach aligns with a call from The Society for Critical Care Medicine Alarm and Alert Fatigue Task Force, that machine learning techniques should be used to advance the quality of alerts that clinicians receive and to individualize alert delivery based on clinician response characteristics, such as alert frequency and severity.[@winters2018society]
For the aggressive strategy, only triggering an alarm for apneas predicted to be prolonged would reduce the total alarm burden and potentially reduce the risk of alarm fatigue. The downside of applying a machine learning approach to the aggressive apnea treatment strategy would be that some prolonged apneas may not receive early intervention if the model incorrectly predicted that the apnea will not last for >30 seconds (i.e. the false negatives). For the conservative strategy, triggering an alarm at the 15-second timepoint for apneas predicted to be prolonged could reduce the total time of apnea, as treatment could be initiated earlier. The downside of applying a machine learning approach to the conservative apnea treatment strategy would be the potential to increase the total alarm burden if the ratio of false positives (predicting the apnea will last for >30 seconds, but it does not) to true positives (correctly predicting at the 15-second time point that the apnea will last for >30 seconds) is high. This study aimed to determine the accuracy of machine learning models for predicting at the 15-second timepoint if a period of apnea will persist for 30 seconds or more. This information would help to determine whether operationalizing these predictions in practice as alarm triggers would be beneficial.
A secondary analysis of a prospective observational study was undertaken. The primary aim of the observational study was to identify common patterns in capnography waveform abnormalities and factors that influence these patterns. Results for the primary aim of the observational study are reported elsewhere.[@conway2019sequence]
All participants provided written informed consent and the study was approved by human research ethics committees (UCH HREC 1614; SVHAC HREC 16/26; QUT 1600000641).
The prediction goal was to classify apneic events at the 15-second timepoint as either short (i.e. terminated prior to 30 seconds) or prolonged (persisted for 30 seconds or longer). The prediction algorithm was compared against typical default alarm settings for capnography monitors.
Participants in the study were consecutive adult patients scheduled to undergo an elective procedure in the cardiac catheterization laboratory with moderate sedation. Patients with severe cognitive impairment (due to inability to provide informed consent) or unable to understand and speak English (if an interpreter was unavailable) were excluded. Data collection was performed at two urban private hospitals in Australia.
The sedation regimen used for patients included in this study comprised bolus doses of intravenous midazolam and fentanyl. Sedation was administered by nurses who were trained in advanced life support. Routine clinical monitoring included continuous cardiac rhythm and oxygen saturation monitoring as well as non-invasive blood pressure measurements every 5-10 minutes. The Respironics LoFlo sidestream CO~2~ sensor was used for capnography monitoring. A CO~2~ sampling cannula was inserted into the sideport of an oxygen face-mask or was integrated as a separate line for nasal cannulas. The capnography waveform was displayed on the main physiological monitoring screen. A default ‘No breaths detected’ alert was triggered for apnea, but no other audible or visual alarms were set for the capnography monitor. No restrictions or specific instructions were provided to clinicians regarding the detection of capnography waveform abnormalities as part of the research protocol because the study used an observational design.
Data were collected from August 2016 to May 2018. Demographic data and clinical characteristics were collected from medical records or directly from participants prior to procedures. Intra-procedural data were collected in real-time by the researcher who was present in the procedure room. Direct observation of the participant was required to record the timing of sedation administrations and any interventions applied by sedation providers.
Several raw demographic (age, sex) and clinical (ASA score, diagnosis of sleep apnea, body mass index, dose and type of sedation and analgesia administered) variables were used as predictors. Features related to sedation dosing used as predictors in the model were the total dose of sedation and number of doses of sedation administered, time since first sedation and the time since the previous dose of sedation. Other features were extracted from the capnography waveform to use as predictors, such as the previous respiratory state (defined as either normal breathing or abnormal), duration of the previous apneic event, time since the previous apneic event, and the total number of apneic events. A total of 18 predictor variables were used. No imputation of missing data was performed. No imputation of missing data was performed.
Analyses were performed using R version 4.0.3.[@rcore2021] Data as well as details about how to access the code and a reproducible computing environment required to verify the results is available.[@conway2021code; @conway2021data]
We selected several candidate models to evaluate, including a random forest model, generalized linear model (logistic regression), lasso regression, ridge regression, and the XGBoost model. Out of sample accuracy of the models was calculated using 10-fold cross-validation. Many participants in the study contributed multiple apneic events to the dataset used for modeling. To take this dependency into account, we ensured that apneic events from individual participants were not included in both the training and testing partitions of the 10-fold cross-validation process. Pre-processing steps included normalizing numeric predictors and using an interaction term for the duration of the previous respiratory state and the total number of apneic events. Discriminative ability of the models was compared using the area under receiver operating characteristics curve (AUROC) as well as by plotting sensitivity, specificity, positive predictive values and negative predictive values (termed a threshold performance plot). A calibration plot with a loess smoother was used to assess calibration.[@van2016calibration] The runway
package was used to create the plots.[@singh2021]
We used the net benefit decision analytic measure to assist with deciding whether using the models in practice would lead to better outcomes on average than the default capnography alarm management strategies currently in place. The default strategies are: 1) the aggressive approach, which involves triggering an alarm after brief periods of apnea (typically 15 seconds); and 2) the conservative approach, which involves triggering an alarm for only prolonged periods of apnea (typically 30 seconds). The net benefit is calculated using the formula:
$$ NB = \frac{TP}{N} - \frac{FP}{N} \times \frac{Pt}{1-Pt} $$
Where $NB$ is the net benefit, $TP$ is the total number of true positives (apneic event at 15 seconds which was predicted to be prolonged and did persist for >30 seconds), $FP$ is the total number of false positives (apneic events at 15 seconds that was predicted to be prolonged but did not persist for >30 seconds), $N$ is the sample size (number of predictions made), and $Pt$ is the probability threshold.
This formula essentially transforms the total number of true positives and false positives into a standardized scale, weighted by the relative harm of a false-positive result.[@vickers2006decision] For example, a net benefit of 0.07 means the net benefit of using the model would be seven true positives from every 100 predictions from the model. This net benefit can result from any combination of true positives and false positives.[@vickers2016net] A probability threshold of 0.5 indicates that avoiding a false positive is as important to a clinician as identifying a true positive. Preferences for probability thresholds below 0.5 are weighted such that identifying a true positive is more valuable than avoiding a false positive. Preferences for probability thresholds above 0.5 are weighted such that avoiding a false positive is more valuable than identifying a true positive. For example, for a $Pt$ of 0.75, the 'value' of a false positive is three true positives (0.75/0.25 = 3). In other words, to create a net benefit from using the model at this probability threshold, there must be more than three true positives for every false positive prediction made from the model. Conversely, for a $Pt$ of 0.25, the 'value' of a false positive is weighted far lower, at only a third of a true positive (0.25/0.75 = 0.33). This means that a net benefit would be achieved if there was more than one true positive for every three false positives. Decision curves can be interpreted such that the strategy with the highest net benefit at each probability threshold has the highest clinical value.[@vickers2016net]
We created a decision curve to plot net benefits across a range of probability thresholds for the aggressive alarm management strategy (alarm triggered at 15 seconds of apnea) and the conservative alarm management strategy (alarm triggered at 30 seconds of apnea). The decision curve takes into account the full range of reasonable clinician preferences for the point at which an alarm should be triggered to signal that the patient is apneic. We tested thresholds in the range from 0.3 to 0.5 for the aggressive alarm management strategy. In a practical sense, this means that we decided that all clinicians who usually use the aggressive alarm strategy would not ever accept a probability for prolonged apnea lower than 0.3 as a useful alarm trigger because there would be little difference between this strategy and just setting the alarm for all apneas. We also decided that all clinicians would always consider that an alarm is indicated for the aggressive alarm management strategy if the probability of prolonged apnea was higher than 0.5. A range of values was used because these probability thresholds can be interpreted as value preferences that individual clinicians may reasonably choose in the context of clinical practice. For example, a clinician who is more 'risk-averse' may elect for a more conservative probability threshold (closer to 0.3). Individual participant characteristics will also influence clinicians' decisions about probability thresholds. A clinician may elect to intervene when the probability for prolonged apnea is 0.3 for an elderly patient with multiple comorbidities but not for a young patient who may be more likely to be able to tolerate longer periods of apnea. For the conservative alarm management strategy comparison, we chose to plot the range of probability thresholds from 0.7 to 0.8. Higher values were chosen because the number of false positives would be an important consideration for clinicians already using a conservative alarm management approach.
A total of 384 apneic events of at least 15 seconds duration from 61 of the 102 patients who participated in this observational study were included in the present analysis. A summary of participant characteristics is presented in Table \@ref(tab:tab1). Nearly half of the (n=180) apneic events were prolonged (i.e. >30 seconds).
targets::tar_read(summary_table)
A plot of the area under the receiver operating characteristics curves for the models using predictions from 10-fold cross-validation is presented in Figure \@ref(fig:roc).
The random forest had the best discriminatory power of the models, with a mean AUROC score of 0.66 (standard error 0.03). A threshold performance plot, which summarises the discriminatory power for the models, including values for sensitivity, specificity, positive predictive value, and negative predictive value across all probability thresholds, is presented in Figure \@ref(fig:tpp).
The random forest model had the best calibration. It approximated observed risk at moderate (0.5) to high (0.8) thresholds (Figure \@ref(fig:cal)), although the risk was overestimated at very low thresholds and slightly underestimated between 0.4 to 0.5. Other models severely overestimated risk at low probability thresholds and underestimated risk at high probability thresholds.
As the random forest performed the best in terms of discrimination and calibration, we chose this model to evaluate using decision curve analysis. The net benefit associated with the random forest model exceeded the aggressive alarm management strategy across all probability thresholds from the range of 0.3 to 0.5 (Figure \@ref(fig:dca)). The interpretation is that the best clinical outcome would be achieved for clinicians who are willing to initiate an intervention to treat apnea at the 15-second mark if the probability of it being prolonged was more than 40% by using the random forest model. The net benefit associated with the random forest model was lower than the conservative alarm management strategy across all probability thresholds from the range of 0.7 to 0.8 (Figure \@ref(fig:dca)).
In this study, we found that a random forest model had the best discriminative ability and calibration for predicting if an apneic event during nurse-administered procedural sedation would be prolonged. However, it should be noted that the accuracy of this random forest model was still quite low (AUROC = 0.66). Additional research should be undertaken with larger sample sizes to validate our initially promising findings.
Results from prior research indicated that using information about the history of previous respiratory states may be a promising solution for predicting durations of apneic periods. It was found in a study of capnography waveform abnormalities during nurse-administered sedation that the hazard for apnea increased two-fold (hazard ratio 2.14, 95% CI 1.75 - 2.62) when a patient was in a state of hypoventilation (defined as >10% reduction in ETCO~2~ from baseline).[@conway2019pre] The risk of apnea also increased with each additional sedation dose (hazard ratio 2.86; 95% CI 2.15 - 3.81).[@conway2019pre] Results from an earlier study in a different population also supported the observations that the onset of apneic periods during sedation are associated with the history of previous respiratory states. Krauss and colleagues[@krauss2016] used survival analysis to model the time to first apneic events in a sample of 312 patients undergoing procedural sedation with propofol or ketamine in the emergency department. It was identified that having an abnormal ETCO~2~ measurement 30, 60 and 90 seconds prior to an apneic event increased risk for apnea (hazard ratios with 95% CI for the different lag durations of 2.45 (1.63 - 3.69), 1.88 (1.21 - 2.92) and 2.06 (1.36 - 3.11), respectively). In our study, ee leveraged information about the associations between apneic events and the history of previous respiratory states by building a predictive model using a machine learning approach. Features included in the models we tested were the previous respiratory state, duration of time in the previous respiratory state, number of previous apneic events and the duration of the previous apneic event.
Many prediction modeling studies focused on predicting clinical outcomes have yielded similarly low AUROC scores. For example, a recent study of the predictive ability of vital sign parameters for detection of clinical deterioration in subacute care patients yielded an AUROC score of 0.57.[@considine2020vital] When presented with a model with low AUC scores, decision curve analysis can help elicit whether the model is "good enough" to use in practice. Based on our results, nurses currently using the conservative alarm management strategy (alarm only at 30 seconds of apnea) who are willing to 'value' a false positive about 2-3 times more than a true positive would not derive an overall net benefit from using the random forest model as a trigger for apnea alarms. This is because using the random forest model would produce a worse outcome, in terms of the balance between true positives and false positives, for detecting if an apneic event will be prolonged, than a default strategy of waiting to trigger the alarm until the 30-second threshold.
Conversely, nurses currently using the aggressive capnography alarm management strategy (alarm triggered at 15 seconds of apnea) who are willing to 'value' a false positive about two-thirds the importance of a true positive would derive an overall net benefit from using the random forest model as a trigger for apnea alarms. Using the random forest model as an additional input for an alarm trigger would reduce the total alarm burden and could be considered as an option for implementation into practice. To operationalize these predictions into capnography monitors, partnerships with industry would be required because monitor functionality would need to be adapted to facilitate input of the data required to calculate the predictions.[@glasgow2018nurse] These would include patient characteristics and sedation dosing. Integrating predictive models into alarm management strategies for respiratory monitoring devices is also indicated in other contexts. For example, a recent study found that opioid-induced respiratory depression during recovery from anesthesia can be accurately predicted using a machine learning approach.[@jungquist2019identifying] In addition, user-centered design considerations, such as how the predictions should be communicated to nurses responsible for decision-making, would be important avenues for further research prior to implementation.[@risling2020advancing]
This study used decision curve analysis as a way to evaluate the potential clinical impact that using the model as input for capnography alarm management would have on the number of alarms triggered (i.e. false positives/negatives). However, as with any intervention in healthcare, efficacy needs to be assessed prior to broader implementation. Efficacy in this context would be whether using the model as input for capnography alarm management improves patient safety. The gold standard approach for such an evaluation is a randomized controlled trial. Randomized controlled trials testing alarm conditions that have integrated predictions from machine learning models have been conducted previously in similar contexts, such as intra-operative blood pressure management.[@maheshwari2020hypotension; @wijnberge2020effect]
The finding that the model produced an overall net benefit against the aggressive capnography alarm management strategy but not the conservative approach is noteworthy. Further research with larger sample sizes is needed to increase the predictive power of models aimed at predicting the duration of apnea. Such research is warranted because triggering an alarm after 30 seconds of apnea that would have self-resolved without clinical intervention only 5 seconds later is just as clinically inconsequential as triggering an alert after 5 seconds of apnea that would similarly have resolved after a short period of time. In both of these circumstances, there would not have been enough time for the clinician's intervention to take effect. Yet, presumably in an attempt to reduce alarm burden, many capnography monitoring devices alarm settings default to triggering after apnea has been observed for 30 seconds. An ideal alternative to the conservative alarm strategy would be for clinicians to receive capnography monitor alarms as early as possible in the course of an apneic event, but only if the event will be prolonged for enough time for a clinical intervention to be implemented and exert its intended effect. Results from the present study did not achieve this goal. Yet, previous research indicates that finding such a solution may be worthwhile. An analysis of half a million patients found that respiratory compromise in interventional radiology procedures performed with moderate sedation contributed to worse clinical outcomes and higher costs.[@urman2019impact]
Although the number of apneas included in the models was relatively high, these were contributed by a small number of patients. We used cross-validation to minimize the possibility of overfitting. It should also be noted that the data used in this analysis were produced from an observational study conducted at two hospitals that used a convenience sampling approach. Therefore, the possibility of selection bias should be considered. The context in which the study was conducted should also be considered in regard to external validity. Participants in the study were patients undergoing procedures in a cardiac catheterization laboratory where small, bolus doses of midazolam and fentanyl were used for sedation. Other procedural sedation contexts may use different doses of sedation and the types of medications. It would not be expected that the results of this study could be generalized to those other contexts. A further limitation is that, due to the observational nature of the research design, clinicians were not blinded to capnography measurements. It is possible that interventions implemented by clinicians during the 0 to 30-second apneic period influenced the duration of apnea. However, this mimics real-world practice, in that interventions may be implemented at a clinicians' discretion where no alarm conditions have been met. It should also be noted that only 25% of the sample had sleep apnea, which was one of the predictors included in the model. Due to the small sample size, the dataset used to train the model would have contained only a small portion of patients with sleep apnea and therefore it may not be generalizable to this population. Further research with larger sample sizes is required for confirmation.
We evaluated several candidate models to determine their accuracy in predicting if an apnea at 15 seconds would be prolonged for more than 30 seconds. The random forest model provided the best discrimination and calibration. Decision curve analysis indicated that using the random forest model would lead to a better outcome for capnography alarm management compared to an aggressive strategy where alarms are triggered after 15 seconds of apnea. The model would not be superior to the conservative strategy, where alarms are only triggered after 30 seconds.
\pagebreak
A.C. received support from a National Health and Medical Research Council Early Career Fellowship (APP1091657) and the Bertha Rosenstadt Faculty Small Research Grant from the Lawrence S. Bloomberg Faculty of Nursing at the University of Toronto. This study received funding from the Wesley Medical Research (Project number: 2015-31). Role of the funding source: The funders had no role in study design; in the collection, analysis and interpretation of data; in the writing of the report; and in the decision to submit the article for publication.
None
AUROC: Area under the receiver operating characteristics curve ETCO~2~: End-tidal carbon dioxide CO~2~: Carbon dioxide
\pagebreak
::: {#refs} :::
\pagebreak
Fig. 1 Area under receiver operating characteristics curve
Fig. 2 Threshold performance plot for all models evaluated
Fig. 3 Calibration plot for all models evaluated
Fig. 4 Decision curve analysis plots. Panel A is the comparison for the aggressive alarm management strategy and panel B is the comparison for the conservative alarm management strategy.
\pagebreak
targets::tar_read(roc_plot)
targets::tar_read(tpp)
\pagebreak
targets::tar_read(cal_plot)
\pagebreak
targets::tar_read(dca_plot_combined)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.