extras/DataQualityStudyProtocol.md

Data quality Study Protocol

This is an informatics study that focuses on data quality (rather than a clinical question).

Introduction

Data of sufficient quality is an important pre-requisite for conducting high quality research. In a prior study (Achilles Heel Evaluation), we have conducted an initial proof of concept comparison of using Achilles and Achilles Heel tools across sites. Thanks to this study and additional input on data quality measures, we have implemented new measures and new data quality rules in the latest version of Achilles. (version 1.3 and later).

The present study (Data Quality study) aims to (1) evaluate new rules; (2) test a restricted subset of Achilles outputs that can be compared accross sites; and (3) decide thresholds for new rules that investigate data density (for a general population datasets as well as specialized population datasets). Additional component is (4) comparison of temporal trends to detect deviations from reference data

Methods (for goals 1,2, and 3)

Each site will be assigned a meaningless identifier and selected Achilles analyses (or measures) will be compared.

New rules example

For example, Achilles 1.3 contains a rule that looks at count of distinct units for each measurement_concept_id (rule_id 35). This rule currently has "hard coded" thresholds (rule triggers a notification if 10+ measurments have 5+ units). We hope to study the variability of sites and decide the most appropriate threshold for issuing WARNING or NOTIFICATION type of Achilles Heel output for this problem.

Another example is a rule looking at percentage of measurement that have NULL recorded as their numeric value (rule_id 28). This value can often differ radically in claim datasets versus EHR datasets. Empirical study of variability is needed to decide an appropriate threshold.

Data density measures

Achilles currently computes several useful data density measures that count number of distinct concept_ids per person (e.g., analysis_id's 203 (visits), 403 (condition) or 603 (procedures)). The study is piloting direct use of such data density measures for assesing data quality. It also hopes to explore most appropriate new data quality rules that analyze data density.

Input data

The data compared are aggregated counts (not patient level data). We invite collaborators from the OHDSI community to share the aggregated data from Achilles tool (only selected subset of Achilles analyses, not all Achilles analyses). Some analyses represent row counts (for whole dataset) or percentages of total row counts and those are not patient level data. For Achilles analyses that that do represent count of patients (e.g., analysis 109), cells with less than 10 patients are suppresed during Achilles generation (by specifying parameter smallcellcount=10). Moreover, many of the patient counts are further masked by being expressed as percentage of the overall patient count. This process assures that the data is de-identified by aggregation.

Data protection

Each site can inspect the CSV (and graph) files generated by the study prior submitting the study for comparative analysis by the study team (comparison of sites to each other). Only aggregated data is being submitted for any analysis.

An Amazon S3 cloud mechanism (study specific bucket) is used to ensure that a site can contribute data, but not read data of other sites. Only the study analysis team has access to data from all sites for the purposes of the analysis.

Use of output data

We plan to compare several sites in terms of data quality outputs generated by Achilles and Achilles Heel. The final manuscript about the study results will limit as much as possible what is revealed about each site. Each site will have a chance to review (and edit) the manuscript prior submission to the journal.

If you share your site's data with the DataQuality study principal investigator or the study team, it will be only for the purpose of the study and comparison within the study. All compared sites will be refered to under meaningless site ID. All results will be pooled together so that any site or dataset will be hidden in a crowd of several sites/datasets.

This principle was used in the prior Achilles Heel evaluation study (precursor to this study). If any site requires a formal Data Use Agreement between the your site and the Data Quality Study Principal Investigator, please indicate so.

Methods (for goal 4)

To discover EHR data that differs significantly from expected reference data, we have implemented a separate methodology to classify temporal trends for events (e.g., use of bevacizumab (drug_exposure), vasectomy (procedure_occurrence). (package OHDSITrends)

Data protection (for goal 4)

Trend detection uses precomputed ata from Achilles as input (does not work with patient level data). Readme file in the export folder (available at https://github.com/OHDSI/StudyProtocolSandbox/blob/master/OHDSITrends/inst/export.txt) describes that only limited data is extracted. Only a classification of a given event is extracted for majority of events.

Use of output data

In order to compare if we observer the same trend for a given event, we compare the classification accross sites for a subset of events. (e.g., for venepuncture procedure - site A - rising vs. site B - flat). Only the classification category is exported. (for top 50% of events by event type)

For top 50 rising and declining event trends (within a category), we extract more detailed trend data. A site may define a black-list of events that should not be exported in any extract. (e.g., data on rising wrong side surgery)



vojtechhuser/DataQuality documentation built on May 10, 2020, 8:31 a.m.