The goal of the deidentifiedDB R package is to simplify data transformation of de-identified SARS-CoV-2 Surveillance data at Clemson University for integration into an internal-use SQLite database, which is also called deidentifiedDB.
Since this package is geared towards internal use by Clemson University personnel, functions in this package are designed to throw errors (rather than making assumptions) if there is a change in the input data format.
You can install deidentifiedDB from GitHub with:
# install.packages("devtools")
devtools::install_github("CUGBF/deidentifiedDB")
The above-mentioned installation command will install the following dependencies, if not already installed:
DBI (>= 1.1.0),
dplyr (>= 1.0.0),
lubridate (>= 1.8.0),
RSQLite (>= 2.2.9),
readr (>= 2.0.0),
stringr (>= 1.4.0),
tibble (>= 3.0.0),
tidyr (>= 1.2.0),
zoo (>= 1.8-9),
magrittr (>= 2.0),
tidyselect (>= 1.1.1),
rlang,
Biostrings,
stats,
progress (>= 1.2.2)
viralrecon
nf-core/viralrecon
is used to analyze read data generated by
SARS-CoV-2 sequencing. The results from the pipeline are stored in the
viralrecon
table of the SQLite database.
multiqc
directory). The following columns must be in the csv
file (concordant with viralrecon v2.4.1
): Sample,
# Input reads,
# Trimmed reads (fastp),
% Non-host reads (Kraken 2),
% Mapped reads,
# Mapped reads,
# Trimmed reads (iVar),
Coverage median,
% Coverage > 1x,
% Coverage > 10x,
# SNPs,
# INDELs,
# Missense variants,
# Ns per 100kb consensus,
Pangolin lineage,
Nextclade clade
Tibble with the following structure:
testkit_id,
primer_set,
primer_set_version,
sequencing_platform,
num_input_reads,
num_trimmed_reads_fastp,
pc_non_host_read,
pc_mapped_reads,
num_mapped_reads,
num_trimmed_reads_ivar,
median_coverage,
pc_coverage_gt1x,
pc_coverage_gt10x,
num_snps,
num_indels,
num_missense_var,
Ns_per_100kb,
lineage,
clade,
variant_caller,
viralrecon_version,
run_date
Input → deidentifiedDB::compile_viralrecon()
→ Output
diagnostics
Data from COVID-19 diagnostics testing is housed in the diagnostics
table of the SQLite database.
TestKitId,
Sample_ID,
Date,
PlateId,
P1_A,
P1_B,
N1_A,
N1_B,
Int_P1_A,
Int_P1_B,
Int_N1_A,
Int_N1_B,
P1_Code,
N1_Code,
Rymedi_Result,
Plate_Result,
Run_Number,
Prior_Code,
Sample_Notes
Tibble with the following structure:
testkit_id,
hashed_id,
run_date,
plate,
result,
ct_rnasep_rep1,
ct_rnasep_rep2,
ct_N_rep1,
ct_N_rep2,
control
Input → deidentifiedDB::compile_diagnostics_data()
→ Output
demographics
and sample_collection
Testing Group Name,
Patient City,
Patient Zip Code,
Patient State,
Year of Birth,
Patient Gender,
Pregnant,
Patient Ethnic Group,
Patient Race,
Patient ID,
TestKit ID,
Result description,
Result Date,
Collection Date,
Collection Time,
SKU,
Order Priority,
Performing Facility,
Tested by
csv file containing USPS zip codes and associated location information. Downloadable for academic use from unitedstateszipcodes.org
csv file containing list of all global regions and countries downloadable from UNECE
demographics
Tibble with the following structure:
patient_id
birth_year
ethnicity
race_white
race_asian
race_black_or_african_american
race_american_indian_or_alaskan_native
race_native_hawaiian_or_pacific_islander
deidentifiedDB::prepare_demographics_sc()
→
deidentifiedDB::pull_demographics()
→ Outputpatient_id
s with discrepant information (patient_id
is the primery key for the demographics
table), extract (and
remove) the rows for such patient_id
s from the output, then run
deidentifiedDB::assign_mode()
on the extracted tibble. Outputassign_mode()
to the original tibble
containing demographics information for all other patient_id
ssample_collection
Tibble with the following structure:
testkit_id,
rymedi_result,
population,
order_priority,
collection_date,
result_date,
gender,
pregnancy_status,
zip_code,
city,
county,
state,
country,
zip_code_user_input,
city_user_input,
state_user_input,
patient_id,
teskit_sku,
performing_facility,
testing_facility
deidentifiedDB::prepare_demographics_sc()
→
deidentifiedDB::pull_sc()
deidentifiedDB::get_us_entities()
and manually check the output
from step 1 if any US state name was used by a user instead of the
code. Make manual corrections if needed.deidentifiedDB::compile_sc_data()
→ Final
Outputbiorepository
The biorepository
table in the SQLite database contains information
regarding the storage position in -80C for a subset of COVID-19 positive
samples.
TestKit ID
Box IDN,
Box Position 1,
Vial IDN 1,
Box Position 2,
Vial IDN 2,
Box Position 3,
Vial IDN 3
Tibble with the following structure:
testkit_id,
box_idn,
box_position_1,
vial_idn_1,
box_position_2,
vial_idn_2,
box_position_3,
vial_idn_3
deidentifiedDB::compile_biorepo()
→ Outputgenbank
All sequenced samples that were submitted to GenBank are recorded in the
the genbank
table of deidentifiedDB database
Accession
Sequence ID
Release Date
Output_list[['int_tbl']]
returned by
deidentifiedDB::compile_genbank()
Tibble with the following structure:
testkit_id
sequence_ID
genbank_accession
pipeline
submission_date
release_date
deidentifiedDB::compile_genbank_table()
→ Outputtestkit_id
s to be submittedtestkit_id
(found in
/variants/ivar/consensus/bcftools/)sample_collection
, demographics
and viralrecon
tables
discussed above.R list with the following elements:
int_tbl
- Tibble for internal recordsext_tbl
- Tibble containing metadata in the format required by
GenBankseqs
- DNAStringSet containing the consensus sequences.deidentifiedDB::compile_genbank()
→ Output
The DNAStringSet can be written to a multi-sequence FASTA file by using the following command:
writeXStringSet(Output[['seqs']],
'genbank_submission.fasta')
get_pangolin_distribution()
sample_collection
and viralrecon
tables discussed above.Monthly count of sequenced samples belonging to each Pangolin lineage.
Tibble with the following columns:
collection_month
lineage
n_sequenced_samples
deidentifiedDB::get_pangolin_distribution()
→ Outputget_nextclade_distribution()
sample_collection
and viralrecon
tables discussed above.Monthly count of sequenced samples belonging to each Nextclade
Tibble with the following columns:
collection_month
clade
n_sequenced_samples
deidentifiedDB::get_nextclade_distribution()
→ Outputget_positivity()
Computes Weekly Test Positivity Rate (TPR)
sample_collection
and viralrecon
tables discussed above.Tibble with the following columns:
collection_week
week_start
week_end
order_priority
TOTAL
POSITIVE
NEGATIVE
POSITIVITY
deidentifiedDB::get_positivity()
→ Outputget_daily_diagnostics()
diagnostics
table discussed above.Tibble with the following columns:
<grouping_variables>
count
deidentifiedDB::get_daily_diagnostics()
→ OutputMake sure are specified in the function call.
For example:
deidentifiedDB::get_daily_diagnostics(diagnostics_tbl,
start_date,
end_date,
*run_date*,
*result*)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.