data_description.md

Data description for programmers/statisticians

This is a description of the measured values intended for people analyzing the data. Medical knowledge is not assumed, though the level of detail of the medical part would be shallow.

Currently this document describes the fake dataset generated by fake_data_grein which contain less markers and comorbidities than the actual data would. But the overall structure should stay the same. Also the document is based on consultations with clinicians, but was not checked by a clinician yet - all mistakes are my own.

Basics

The data include only hospitalized patients. We gather both patient data which correspond to the state of the patient upon admission to hospital and some summary values and disease progression data that are measured repeatedly over the course of the hospitalization. The dataset will be represented by a list that contains elements for each type of data.

The primary patient-centric measurement is the final outcome (discharged, deceased or continued hospitalization) and the breathing support the patient requires, this can be one of:

  1. AA (Ambient air, no support required)
  2. Oxygen (supplemental oxygen by a nasal tube or a light mask)
  3. NIPPV (Non-invasive positive pressure ventilation)
  4. MV (Mechanical, invasive ventilation)
  5. ECMO (Extra corporeal membrane oxygenation - the patient’s blood is oxygenated outside of their body).

Those are strictly ordered by severity.

Patient data

Patient data is stored in the patient_data list element.

Basic quantities:

Comorbidites:

Derived quantities

Those are quantities derived from disease progression data that might be useful in analysis:

Disease progression data

The most important part of the disease progression data is the breathing data which contains the breathing support used for each day. Those data should not have any gaps and cover the whole hospitalization period. Breathing data is stored in the breathing_data list element. The columns are:

Note that day can in some cases be negative when some data is availabe before hospitalization (this would almost certainly be only PCR test results).

Finally we collect a bunch of clinical markers of which most important are the drugs the patient used. Those are available in both long and wide formats (as marker_data and marker_data_wide). Markers are not measured every day and can be systematically missing for a whole site. The frequency of measurement of different markers can differ.

In the long format, the columns are:

In the wide format there is a column for each marker and for those, that can be censored an addtional xx_censored column.

The markers are:

For markers, missing values indicate the marker was not measured for the day.

The drugs are:

For drugs, the values indicate the dose. Missing values indicate the patient didn't take the drug the given day.

It probably doesn't make a lot of sense to distinguish different dosing regimes of the drugs (there won't be enough data). Also, the effect of the drugs should be longer than the days they were taken - this is especially true for HCQ which is only very slowly removed from the body and can stay quite long at therapeutic concentrations even after the patient stopped taking it. For this reason it probably makes sense to analyse only "days before taking the drug" and "days after taking the drug for the first time".



cas-bioinf/covid19retrospective documentation built on Sept. 7, 2021, 6:19 p.m.