Ludvig Wärnberg Gerdin^1,*^ , Alan Hubbard^2^, Anurag Mishra^3^, Catherine Juillard^4^, Debojit Basak^5^, Deepa Kizhakke Veetil^6^, Greeshma Abraham^5^ Jyoti Kamble^5^, Kapil Dev Soni^7^, Makhan Lal Saha^8^, Monty Khajanchi^9^, Nitin Borle^10^, Vineet Kumar^11^, Sara Moore^2^, Martin Gerdin Wärnberg^12*,**^
1 Department of Industrial Economics and Management, KTH Royal Institute of Technology, Stockholm, Sweden
2 Division of Epidemiology and Biostatistics, School of Public Health, University of California, Berkeley, California, USA
3 Department of Surgery, Maulana Azad Medical College, New Delhi, Delhi, India
4 Department of Surgery, Center for Global Surgical Studies, University of California, San Francisco, California, USA
5 Tata Institute of Social Sciences and Doctors for You, Mumbai, Maharashtra, India
6 Department of General Surgery, Manipal hospital, Human Care Medical Charitable Trust, New Delhi, Delhi, India
7 Critical & Intensive Care, JPN Apex Trauma Centre, All India Institute of Medical Sciences, New Delhi, Delhi, India
8 Department of Surgery, Institute of Post-Graduate Medical Education and Research and Seth Sukhlal Karnani Memorial Hospital, Kolkata, West Bengal, India
9 Department of Surgery, Seth Gowardhandas Sunderdas Medical College and King Edward Memorial Hospital, Mumbai, Maharashtra, India
10 Department of Surgery, Khershedji Behramji Bhabha hospital, Mumbai, Maharashtra, India
11 Department of Surgery, Lokmanya Tilak Municipal General Hospital, Mumbai, Maharashtra, India
12 Global Health: Health Systems and Policy, Department of Global Public Health Sciences, Karolinska Institutet, Stockholm, Sweden
^*^ These authors contributed equally to this work.
^**^ Senior authorship.
Current Address: Martin Gerdin Wärnberg, Department of Public Health Sciences, Karolinska Institutet, 171 77 Stockholm, Sweden
martin.gerdin\@ki.se
A key component of trauma care is the process of prioritizing patients to match level of care with clinical acuity. In many emergency departments in low resource setting hospitals trauma patients arrive with no or little prenotification. In such settings patients are often prioritised by clinicians based on patients' presentation. We aimed to compare the performance of an ensemble machine learning methodology called SuperLearner to that of clinician gestalt based on patients' presentation.
Our hypothesis was that the performance of the SuperLearner would be
non-inferior to that of clinician gestalt in terms of
classification. We used data from an ongoing prospective cohort study
in three public hospitals in urban India. Adult patients presenting to
the emergency departments of these hospitals with history of trauma
were approached for enrolment. The outcome was all cause mortality
within 30 days of arrival to a participating centre. For the purpose
of this study, clinicians were instructed to assign patients to one of
four levels corresponding to clinical acuity. The SuperLearner
included five machine learners and was developed in a training sample
and then compared to clinicians in a test sample. Performance was
compared in terms of reclassification and area under the receiver
operating characteristics curve (AUROCC). We concluded that the
SuperLearner was non-inferior to clinicians if the lower bound of the
95% confidence intervals (CI) of the net reclassification in events
was not less than -0.05. From 28 July 2016 to 21 November 2017 we
approached a total of r results$n.enrolled
patients for enrolment. Out of
these, r results$n.consent
patients consented,
had a priority level assigned by a clinician, and had complete outcome
data and were therefore included in subsequent analysis. A total of
r
patients died within 30 days. We used
a temporal split to divide the cohort into a training and test
sample. The training sample included r nrow(results$s30d.results$samples$train)
patients and the test sample r nrow(results$s30d.results$samples$test)
patients. The
AUROCCs of the priority levels assigned by the SuperLearner and
clinicians were r results$estimates.s30d["cut.model.test"]
and
r results$estimates.s30d["tc.test"]
respectively. The difference in AUROCC was
r results$estimates.s30d["cut.model.test.diff"]
. The net reclassification in
events was r results$estimates.s30d["nri.plus.test"]
and in non-events r results$estimates.s30d["nri.minus.test"]
.
In terms of classification and discrimination an ensemble machine learning algorithm developed using the SuperLearner was non-inferior in prioritising among adult trauma patients in the ED compared to clinician gestalt based on patients' presentation.
Trauma kills almost five million people each year. A majority of these deaths occur in low resource settings. New methods are needed to prioritise among trauma patients in the emergency department and quickly identify patients in need of immediate care. Machine learning could potentially help to do so, but so far the use of machine learning in trauma research has been slow. We aimed to compare the performance of an ensemble machine learning methodology called SuperLearner to that of clinician gestalt based on patients' presentation.
We analysed data from r results$descriptive.statistics$n.has.triage.category
adult trauma
patients who presented to emergency departments at three public hospitals in
urban India. Out of these r Report("All cause 30-day mortality", "Yes", "s30d")
patients died from any
cause within 30 days of arrival to a participating hospital. We used the
SuperLearner to combine multiple machine learners to assign priority levels to
included trauma patients based on demographic and clinical patient
characteristics. We asked clinicians to also assign priority levels to the same
patients and compared the performance of the SuperLearner and clinicians. We
found that the SuperLearner was non-inferior to clinicians.
Using an ensemble machine learning algorithm to prioritise among trauma patients in the ED may allow clinicians to focus on treating patients. This would free valuable resources that are particularly scarce in the low resource settings where most trauma deaths occur.
Trauma is a major threat to population health globally [@Brohi2017; @GBD2017]. Every year about 4.6 million people die because of trauma - more than the deaths from HIV/AIDS, malaria and tuberculosis combined. This situation calls for not only more interventions, but also strengthened research on effective trauma care delivery.
Trauma care is highly time sensitive and delays to treatment have been associated with increased mortality across settings [@Yeboah2014; @OReilly2013; @Roy2017a]. Early identification and management of potentially life threatening injuries are crucial. Trauma triage - the process of prioritizing patients to match level of care with clinical acuity - is a key component of trauma care [@EAST2010; @NICE2016].
In health systems with formalised criteria for emergency department trauma triage, all patients are assigned a priority coupled with a target time to treat. These priorities are may be coded with numbers [@ESI2012] or colors [@SATG2012], for example red, orange, yellow and green, with red being assigned to the most urgent patients and green to the least urgent.
In health systems without formalized criteria, for example in many low resource settings, clinician gestalt is used informally triage trauma patients in the emergency department [@Baker2013]. Where prehospital care is lacking patients often arrive to the emergency department without warning [@Choi2017]. Identifying ways to quickly triage patients would therefore be valuable such settings.
The approach to triage trauma patients arriving to the emergency department has received little attention from the research community. Framed as a classification problem this challenge can be addressed using a statistical learner. Logistic or proportional hazards models are common classification learners whereas more modern alternatives include random forests or neural networks.
The uptake and use of such learners in trauma research has been slow [@Liu2017]. One recent study used a random forest learner to assign priority to patients in a general emergency department population, and found a slight performance improvement using this learner compared with the standard criteria [@Levin2018].
Given the paucity of research leveraging machine learning to triage trauma patients in the emergency department, we aimed to compare the performance of an ensemble machine learning model to that of clinician gestalt based on patients' presentation. Our hypothesis was that the performance of the this ensemble model would be non-inferior to that of clinician gestalt.
We used data from the ongoing Trauma Triage Study (TTRIS) in India, a Towards Improved Trauma Care Outcomes (TITCO) study. This study is a prospective cohort study in three public hospitals in urban India.
Data analysed for this study came from patients enrolled between 28 July 2016 and 21 November 2017 at the three hospitals Khershedji Behramji Bhabha hospital (KBBH) in Mumbai, Lok Nayak Hospital of Maulana Azad Medical College (MAMC) in Delhi, and the Institute of Post-Graduate Medical Education and Research and Seth Sukhlal Karnani Memorial Hospital (SSKM) in Kolkata. The time frame was decided to ensure that all included patients had completed six months follow up.
KBBH is a community hospital with 436 inpatient beds. There are departments of surgery, orthopaedics, anaesthesia, and both adult and paediatric intensive care units. It has a general ED where all patients are seen. Most patients present directly and are not transferred from another health centre. Plain X-rays and ultrasonography are available around the clock but computed tomography (CT) is only available in-house during day-time. During evenings and nights patients in need of a CT are referred elsewhere.
MAMC and SSKM are both university and tertiary referral hospitals. This means that all specialities and imaging facilities relevant to trauma care, except emergency medicine, are available in-house around the clock. MAMC has approximately 2200 inpatient beds and SSKM has around 1775 inpatient beds. Both MAMC and SSKM have general emergency departments. Because both MAMC and SSKM are tertiary referral hospitals a large proportion of patients arriving at their EDs are transferred from other health facilities, with almost no transfer protocols in place.
Prehospital care is rudimentary in all three cities, with no organised emergency medical services. Ambulances are predominately used for inter-hospital transfers and most patients who arrive directly from the scene of the incident are brought by the police or in private vehicles.
Patients arriving to the emergency department are at all centres first seen by a casualty medical officer on a largely first come first served basis. There is no formalised system for prioritising emergency department patients at any of the centres.
The research was approved by the ethical review board at each participating hospital. The names of the boards and the approval numbers were Ethics and Scientific Committee (KBBH, HO/4982/KBB), the Institutional Ethics Committee (MAMC, F.1/IEC/MAMC/53/2/2016/No97), and the IPGME&R Research Oversight Committee (SSKM, Inst/IEC/2016/328).
Data were collected by one dedicated project officer at each site. The project officers all had a masters degree in life sciences. They worked five shifts per week, and each shift was about eight hours long, so that mornings, evenings and nights were covered according to a rotating schedule. In each shift, project officers spent approximately six hours collecting data in the emergency department and the remaining two following up patients. The collected data were then transferred to a digital database. The rationale for this setup was to ensure collection of high-quality data from a representative sample of trauma patients arriving to the emergency departements at participating centres, while keeping to the projects budget constraints.
Any person aged $\geq$ 18 years or older and who presented alive to the emergency department of participating sites with history of trauma was included. The age cutoff was chosen to align with Indian laws on research ethics and informed consent. We defined history of trauma as having any of the external causes of morbidity and mortality listed in block V01-Y36, chapter XX of the International Classification of Disease version 10 (ICD-10) code book as primary complaint. Drownings, inhalation and ingestion of objects causing obstruction of respiratory tract, contact with venomous snakes and lizards, accidental poisoning by and exposure to drugs, and overexertion were excluded because they are not considered trauma at the participating centres.
The project officers enrolled the first ten consecutive patients who presented to the emergency department during each shift. The number of patients to enrol was set to ten to make follow up feasible. Written informed consent from the patient or a patient representative was obtained either in the emergency department or in the ward if the patient was admitted. A follow-up was completed by the project officer 30 days and 6 months after participant arrived at participating hospital. The follow-up was completed in person or on phone, depending on whether the patient was still hospitalised or if the patient had been discharged. Phone numbers of one or more contact persons (e.g. relatives), were collected on enrolment and contacted if the participant did not reply on follow up. Only if neither the participant nor the contact person answered any of three repeated phone calls was the outcome recorded as missing and the patient was considered lost to follow up.
The ensemble model were trained on two target variables. The first target was all-cause 30 day mortality, defined as death from any cause within 30 days of arrival to a participating centre. These data were extracted from patient records if the patient was still in hospital 30 days after arrival, or collected by calling the patient or the patient representative if the patient was not in hospital. The second target was a composite outcome of early mortality, ICU admission, major urgent surgery and severe injury. Early mortality was defined as death within 24 hours of arrival to the participating hospitals. ICU admission was defined as admission to the ICU within 48 hours of arrival. While most ICU admissions occur within hours of hospital arrival, we extended the time frame to compensate for bed availability and transfer delays. Major urgent surgery was defined as a major surgery performed within 24 hours of arrival to the participating hospitals. Surgical excerpts were reworked into standardized nomenclature using the Nordic Medico-Statistical Committee (NOMESCO) Classification of surgical Procedures, and the Systematized Nomenclature in Medicine Clinical Terms (SNOMED CT). In the lack of a general definition of major surgery, a team of experienced surgeons and researches decided on what surgeries to consider as major. Severe injury was defined as an Injury Severity Score (ISS) over 15. This cutoff-value for severe injury is traditionally sued in trauma research since it is said to indicate a 10 % mortality (@Boyd1987). TAKEN FROM CELINAS PAPER
The features included patient age in years, sex, mechanism of injury, type of injury, mode of transport, transfer status, time from injury to arrival in hours. The project officers collected data on these features by asking the patient, a patient representative, or by extracting the data from the patient's file. Sex was coded as male or female. Mechanism of injury was coded by the project officers using ICD-10 after completing the World Health Organization's (WHO) electronic ICD-10-training tool [@WHOICD]. The levels of mechanism of injury was collapsed for analysis into transport accident (codes V00-V99), falls (W00-W19), burns (X00-X19), intentional self harm (X60-X84), assault (X85-X99 and Y00-Y09), and other mechanism (W20-99, X20-59 and Y10-36). Type of injury was coded as blunt, penetrating, or both blunt and penetrating. Mode of transport was coded as ambulance, police, private vehicle, or arrived walking. Transfer status was a binary feature indicating if the patient was transferred from another health facility or not.
The features also included vital signs measured on arrival to the ED at participating centres. The project officers recorded all vital signs using hand held equipment, i.e. these were not extracted from patient records, after receiving two days of training and yearly refreshers. Only if the hand held equipment failed to record a value did the project officers extract data from other attached monitoring equipment, if available. Systolic and diastolic blood pressure (SBP and DBP) were measured using an automatic blood pressure monitor. Heart rate (HR) and peripheral capillary oxygen saturation (SpO~2~) were measured using a portable non-invasive fingertip pulse oximeter. Respiratory rate (RR) was measured manually by counting the number of breaths during one minute. Level of consciousness was measured using both the Glasgow coma scale (GCS) and the Alert, Voice, Pain, and Unresponsive scale (AVPU). In assigning GCS the project officers used the official Glasgow Coma Scale Assessment Aid [@GCSAID]. AVPU simply indicates whether the patient is alert, responds to voice stimuli, painful stimuli, or does not respond at all.
These represent standard variables commonly collected in many health systems. They are also included in several well known clinical prediction models designed to predict trauma mortality [@Rehn2011].
For the purpose of this study, clinicians were instructed by the project officers to assign a priority to each patient. The priority levels were color coded. Red was assigned to the most serious patients that should be treated first. Green was assigned to the least serious patients that should be treated last. Orange and yellow were intermediate levels, where orange patients were less serious than red but more serious than yellow and green whereas yellow patients were less serious than red and orange patients but more serious than green patients. The clinicians were allowed to use all information available at the time when they assigned the priority level, which was as soon as they had first seen the patient. The priorities were not used to guide further patient care and no interventions were implemented as part of the study for patients assigned to the more urgent priority levels.
Project officers underwent two days of training in study procedures and were then supervised locally. We conducted continuous data quality assurance by having weekly online data review meetings during which data discrepancies were identified, discussed and resolved. We conducted quarterly on site quality control sessions during which data collection was conducted both by the centre's own project officer and a quality control officer. Data entry errors were prevented by having extensive logical checks in the digital data collection instrument.
All data was de-identified before it was analysed for this study. Details of the de-identification procedures are available as supporting information. We used R for all analyses [@R]. We first made a non-random temporal split of the complete data set into a training and test set. The split was made so that 75% of the complete cohort was assigned to the training set and the remaining 25% to the test set, ensuring that the relative contribution of each centre was maintained in both sets. We then calculated descriptive statistics of all variables, using medians and inter quartile ranges (IQR) for continuous variables and counts and percentages for qualitative variables. All quantitative features (age, SBP, DBP, HR, SpO~2~, and RR) were treated as continuous and the levels of all qualitative variables (sex, mechanism of injury, type of injury, mode of transport, transfer status, and GCS components) were treated as bins (dummy variables).
The study sample was split into three parts, henceforth referred to as the training, validation, and test sets. We then developed our ensemble model in the training and validation sets using the SuperLearner R package [@SuperLearner]. SuperLearner is an ensemble machine learning algorithm, meaning that it combines predictions several learners to come up with an "optimal" learner. Table \@ref(tab:superlearner-library) shows our library of learners. All were implemented using the default hyperparameters. Short descriptions of the individual learners are available as supporting information.
The ensemble model was trained using ten fold cross validation. This procedure is implemented by default in the SuperLearner package and entails splitting the development data in ten mutually exclusive parts of approximately the same size. All learners included in the library are then fitted using the combined data of nine of these parts and evaluated in the tenth. This procedure is then repeated ten times, i.e. each part is used once as the evaluation data, and is intended to limit overfitting and reduce optimism.
The ensemble model predictions was then used to assign levels of priority to patients. This was done by binning the ensemble model prediction into four bins using cutoffs identified using a grid search to optimize the area under the receiver operation characteristics curve (AUROCC) across all possible combinations of unique cutoffs, where each cutoff could take any value from 0.01 to 0.99 in 0.01 unit increments. These bins corresponded to the green, yellow, orange, and red priority levels assigned by the clinicians. The cutoffs were identified in the validation set in order to prevent information leakage and limit bias. The performance of both the continuous ensemble model prediction and the ensemble model priority levels was then evaluated by estimating their AUROCC. We also visualised the performance by plotting ROC curves.
create <- function() return (tibble::tribble ( ~Learner, ~"R package", ~"SuperLearner function", "Breiman's random forest algorithm", "randomForest [@randomforest]", "SL.randomForest", "Extreme Gradient Boosting machine", "XGboost [@xgboost]", "SL.xgboost", "Generalized Linear Model", "glm (built-in)", "SL.glm", "Generalized Additive Model", "gam [@gam]", "SL.gam", "Penalized regression model using elastic net", "glmnet [@glmnet]", "SL.glmnet" ) %>% knitr::kable(format = "latex", booktabs=TRUE, caption = "Algorithms combined to SuperLearner.")) create()
We then used the ensemble model to predict the outcomes of the patients in the test set and used the cutoff values from the validation set to assign a level of priority to each patient in this set. The performance of the continuous ensemble prediction, the ensemble model priority levels, and the clinicians' priority levels, was then evaluated by estimating and comparing their AUROCC.
The levels of priority assigned by the ensemble model and clinicians respectively were then compared by estimating the net reclassification, in events (patient with the outcome, i.e. who died within 30-days from arrival) and non-events (patient without the outcome) respectively. The net reclassification in events was defined as the difference between the proportion of events assigned a higher priority by the ensemble model than the clinicians and the proportion of events assigned a lower priority by the SuperLearner than the clinicians. Conversely, the net reclassification in non-events was defined as the difference between the proportion of non-events assigned to a lower priority by the ensemble model than the clinicians and the proportion of non-events assigned a higher priority by the SuperLearner than the clinicians.
We used an empirical bootstrap with 1000 draws of the same size as the original set to estimate 95% confidence interval (CI) around differences. We concluded that the SuperLearner was non-inferior to clinicians if the 95% CI of the net reclassification in events did not exceed a pre-specified level of -0.05, indicating that clinicians correctly classified 5 in 100 events more than the ensemble model.
Observations with missing data on all cause 30-day mortality or priority level assigned by clinicians were excluded. Missing data in features was treated as informative. For each feature with missing data we created a non-missingness indicator, a variable that took the value of 0 if the feature value was missing and 1 otherwise. Missing feature values were then replaced with the median of observed data for quantitative features and the most common level for qualitative features. We included the non-missingness indicators as features in the ensemble model.
During the study period, we approached a total of
r sprintf("%d", round(as.integer(results$s30d.descriptive.statistics$n.enrolled)*1.01))
patients for enrolment.
A random sample of
r round(as.integer(results$s30d.descriptive.statistics$n.enrolled)*0.01)
observations were removed during data
de-identification. Consent was declined by
r results$s30d.descriptive.statistics$n.not.consent
patients. Out of the
r results$s30d.descriptive.statistics$n.consent
patients who provided informed consent,
r results$s30d.descriptive.statistics$n.not.adults
were not adults, and
r results$s30d.descriptive.statistics$n.not.has.triage.category
had missing information on
triage category, leaving r results$s30d.descriptive.statistics$n.has.triage.category
.
For 30-day mortality,
r results$s30d.descriptive.statistics$n.not.complete.outcome
had missing outcome data, whereas for the composite outcome r results$composite.descriptive.statistics$n.not.complete.outcome
had missing data.
Thus, the final samples included r results$s30d.descriptive.statistics$n.complete.outcome
and
r results$composite.descriptive.statistics$n.complete.outcome
patients for the 30-day mortality and composite outcomes, respectively.
knitr::include_graphics("./figures/flowchart.manual.pdf")
Table \@ref(tab:s30d-sample-characteristics) and \@ref(tab:composite-sample-characteristics) shows the
characteristics of
our samples. A total of r Report("n", strata = "Missing values")
patients had missing values in at least one feature. Among
the included patients the median age was r Report(variable="Age in years", data.variable="age")
years. A
majority, r Report("Sex", "Male", "sex")
patients, were males. The most
common mechanism of injury was transport accidents, accounting for r Report("Mechanism of injury", "Transportation accident", "moi")
patients.
A total of r Report("Mode of transport", "Private vehicle", "mot")
patients were transported to
participating centres in some sort of private vehicle, such as a car,
taxi, or rickshaw. A majority of patients had normal vital signs on
arrival to participating centres. Out of all patients, r Report("All cause 30-day mortality", "Yes", "s30d")
died within 30 days of arrival.
The number of patients in the training, validation, and test samples were r nrow(results$s30d.results$samples$train)
,
r nrow(results$s30d.results$samples$validation)
, and r nrow(results$s30d.results$samples$test)
, respectively.
kableExtra::kbl(results$s30d.formatted.sample.characteristics, format = "latex", booktabs = TRUE, caption = "Sample characteristics for 30-day mortality") %>% kableExtra::kable_styling(latex_options="scale_down") %>% kableExtra::footnote(general = "Abbreviations and explanations: AVPU, Alert, voice, pain, unresponsive scale; DBP, Diastolic blood pressure in mmHg; Delay, Time between injury and arrival to participating centre in minutes; EGCS, Eye component of the Glasgow Coma Scale; HR, Heart rate; MGCS, Motor component of the Glasgow Coma Scale; RR, Respiratory rate in breaths per minute; SBP, Systolic blood pressure in mmHg; SpO\textsubscript{2}, Peripheral capillary oxygen saturation; Transferred, Transferred from another health facility; VGCS, Verbal component of the Glasgow Coma Scale",, threeparttable=TRUE)
kableExtra::kbl(results$composite.formatted.sample.characteristics, format = "latex", booktabs = TRUE, caption = "Sample characteristics for composite outcome") %>% kableExtra::kable_styling(latex_options="scale_down") %>% kableExtra::footnote(general = "Abbreviations and explanations: AVPU, Alert, voice, pain, unresponsive scale; DBP, Diastolic blood pressure in mmHg; Delay, Time between injury and arrival to participating centre in minutes; EGCS, Eye component of the Glasgow Coma Scale; HR, Heart rate; MGCS, Motor component of the Glasgow Coma Scale; RR, Respiratory rate in breaths per minute; SBP, Systolic blood pressure in mmHg; SpO\textsubscript{2}, Peripheral capillary oxygen saturation; Transferred, Transferred from another health facility; VGCS, Verbal component of the Glasgow Coma Scale",, threeparttable=TRUE)
The AUROCC of the continuous ensemble model prediction in the training
sample was r results$estimates.s30d["con.model.train"]
(Fig. \@ref(fig:roc-plot)A) whereas the AUROCC in the validation
sample was r results$estimates.s30d["con.model.validation"]
.
The cutpoints identified by the grid search in the validation sample were r results$s30d.optimal.breaks[1]
,
r results$s30d.optimal.breaks[2]
, and
r results$s30d.optimal.breaks[3]
. We used these cutpoints to bin the continuous
ensemble model prediction into the four priority levels green, yellow,
orange, and red. The AUROCC of the ensemble model priority levels in
the training and validation sample was r results$estimates.s30d["cut.model.train"]
and r results$estimates.s30d["cut.model.validation"]
.
results$roc.plot.s30d
We then applied the ensemble model to the test sample. The AUROCC of
the continuous ensemble model prediction was r results$estimates.s30d["con.model.test"]
(Fig \@ref(fig:roc-plot)B). The performance
of each included learner is available as supporting information in
\nameref{S3_Fig} and \nameref{S4_Table}. We used the same cutpoints as
in the training sample to bin the continuous predictions into the four
priority levels. The AUROCC of the ensemble model priority levels in
the test sample was r results$estimates.s30d["cut.model.test"]
.
In the test sample we compared the performance of the binned ensemble
model prediction with that of clinicians. The AUROCC of priority
levels assigned by clinicians was r results$estimates.s30d["tc.test"]
. The difference in
AUROCC between the ensemble model priority levels and clinicians was
r results$estimates.s30d["cut.model.test.diff"]
. The net reclassification in events
and non-events were r results$estimates.s30d["nri.plus.test"]
and r results$estimates.s30d["nri.minus.test"]
respectively. The overall reclassification is presented in Table \@ref(tab:composite-reclass-all).
#kableExtra::kbl(results$classification.tables.s30d$reclass.all, # format = "latex", # booktabs = TRUE, # caption = "Reclassification measures for s30d") %>% # kableExtra::kable_styling(latex_options="scale_down")
kableExtra::kbl(results$composite.classification.tables$reclass.all, format = "latex", booktabs = TRUE, caption = "Reclassification measures for composite outcome") %>% kableExtra::add_header_above(c(" " = 1, "SuperLearner" = 4, " " = 3)) %>% kableExtra::footnote(general = "Reclassification (Rec.) figures refer to % of patients reclassified by the SuperLearner compared to clinicians. Rec. up and Rec. down indicates % of patients reclassified to a higher or lower priority level respectively.", threeparttable=TRUE)
Fig \@ref(fig:composite-mortality-plot) shows that the number of patients
assigned to each priority level differed substantially between the
SuperLearner and clinicians. This difference was particularly marked
in the green and yellow priority levels. The ensemble model assigned
the green priority level to r results$composite.triage.comparison.plot.data[1,4]
patients
whereas clinicians assigned this level to r results$composite.triage.comparison.plot.data[9,4]
patients. Among the patients that the ensemble model prioritised as
green r results$composite.triage.comparison.plot.data[5,2]
had the composite outcome. The corresponding
figure for the clinicians was r results$composite.triage.comparison.plot.data[13,2]
. In contrast, the
ensemble model assigned the yellow priority level to
r results$composite.triage.comparison.plot.data[2,4]
patients, out of which
r results$composite.triage.comparison.plot.data[6,2]
died. Corresponding figures for the
clinicians were r results$composite.triage.comparison.plot.data[10,4]
and
r results$composite.triage.comparison.plot.data[14,2]
.
results$triage.comparison.plot.composite
Our study suggest that in terms of classification an ensemble machine learner developed with the ensemble model may be non-inferior to clinicians to prioritise among adult trauma patients in the ED. Further, the ensemble learner is superior to clinician gestalt in terms of discrimination. We have not been able to identify any previous study that has applied machine learning to prioritise among trauma patients in the ED. Hence, as far as we know this is the first study of its kind in this area and we hope that our results can work as benchmarks to which future work can be compared.
We found that the ensemble model reclassified non-events to a lower priority level, compared to clinicians, as indicated by the net reclassification in non-events. Specifically, the ensemble model reclassified a majority of patients from the yellow priority to the green priority level. This is analogous to reduced overtriage. Overtriage and undertriage are concepts used extensively in the trauma literature. Undertriage refers to for example patients with major trauma not being transferred to a trauma centre and overtriage to patients with minor trauma being transferred to a trauma centre. Our findings indicate that most of the patients assigned to the yellow priority level by clinicians were overtriaged and strain the health system in face of limited resources. The ensemble model may have the potential to reduce this overtriage substantially.
Three studies have used MMTH learners to limit under and overtriage of trauma patients. Talbert et al. applied a tree based learner but found no improvement over standard criteria [@Talbert2007]. More recent research by Follin et al. demonstrated superior performance of the tree based learner compared to a model based on logistic regression [@Follin2016]. Pearl et al. used neural networks but could not demonstrate a difference [@Pearl2008]. Only Follin report performance measures that can be compared to our results. Their learner achieved an AUROCC of 0.82, which is substantially lower than that of our ensemble learner.
In contrast, the literature is replete with studies using MHTM learners to reduce under and overtriage, or predict trauma mortality [@Rehn2011; @DeMunter2017; @VanRein2018]. The performance of these learners vary substantially, but many studies report AUROCCs that approaches that of our ensemble learner. For example, Miller et al. and Kunitake et al. achieved AUROCCs of almost 0.97 and 0.94 with their models based on logistic regression [@Miller2017]. Neither of these studies however approached the problem of prioritising among trauma patients in the ED, or suggested how the models could be used to assign patients to different priority levels.
Our study was limited by the relatively small sample size. For example, we did not have enough data to run centre wise analysis, which should be a focus of future studies. Instead we concentrated on data quality and had dedicated project officers record all data. This resulted in very low levels of missing feature data. We did however have a considerable amount of missing outcome data, with about 20% of patients being lost to follow up. We handled this missingness using list wise deletion, aware of the potential bias introduced by this approach. One alternative would have been to use multiple imputation to replace missing values, however we had no way of determining the mechanism underlying the missing outcomes why results based on multiple imputed data might be biased as well. Further, we did not consider it computationally feasible to combine multiple imputation and bootstrapping for uncertainty estimation. We do however consider it a strength of our study that the outcome included out of hospital deaths, when comparably recent research does not [@Levin2018; @Kunitake2018].
We used point measurements to train the ensemble model, meaning that we could not account for potential changes in patients' clinical condition between the time when feature and outcome data were collected. The clinicians were however also limited to the data available when they decided on a priority level, although this could have included laboratory or imaging findings from a transferring health facility. Future research may improve the predictions by both the ensemble machine learner and clinicians by including data from multiple time points.
As opposed to the clinicians the ensemble learner was limited by the features that we defined. For example, in our setting with no or very limited electronic record keeping it would have been challenging to incorporate for example imaging data. In settings with more extensive electronic records this should be more feasible. Further, the ensemble learner was limited by the techniques included in its library. We included a mix of MHTM and MMTH learners, for example logistic regression and random forest. The performance of our ensemble learner was already very good, but extending the list of features and techniques available to the learner would likely improve it further. Also, we used the default hyperparameter settings for each technique. Future research may improve the learner's performance by modifying the included learners' hyperparameters.
Several steps remain before a system to prioritise among adult trauma patients in the ED based on the ensemble model can and should be implemented. These steps involve refining the algorithm, comparing it with other commonly used methods to prioritise patients in the ED, incorporating it into usable software that may be used in parallel even in settings with no electronic health records, and designing an implementation study to assess both its effectiveness and safety.
There are many ways in which the algorithm could be refined. We regard optimising the algorithm to minimize deaths in the green priority level as the most important. Secondly, a sequence in which to measure the variables should be defined. We think that this sequence should be based on a combination of individual variable importance and how feasible the variables are to record. We assume that once this sequence is defined the patients with the most severe trauma could be identified very quickly using only a small subset of the variables. Finally, other outcomes should be explored. In a larger dataset a composite outcome of early deaths, e.g. within 24 hours, and admission to intensive care or acute surgery could be explored.
In terms of classification and discrimination an ensemble machine learning algorithm developed using the ensemble model was non-inferior in prioritising among adult trauma patients in the ED compared to clinician gestalt based on patients' presentation. It is possible that the ensemble model is especially useful to reduce the number of patients that would be prioritised to a unnecessarily high priority level.
Details of de-identification procedures.
Short descriptions of included learners.
Receiver operating characteristic and precision recall curves of included learners.
Risk, weight and area under receiver operating characteristic curve of included learners.
We would like to thank the Towards Improved Trauma Care Outcomes and the Trauma Triage Study in India teams.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.