clean_npx: Clean proteomics data quantified with Olink's PEA technology

View source: R/clean_npx.R

clean_npxR Documentation

Clean proteomics data quantified with Olink's PEA technology

Description

This function applies a series of cleaning steps to a data set exported by Olink Software and imported in R by read_npx(). Some of the steps of this function rely on results from check_npx().

This function removes samples and assays that are not suitable for downstream statistical analysis. Some of the data records that are removed include duplicate sample identifiers, external controls samples, internal control assays, and samples or assays with quality control flags.

Usage

clean_npx(
  df,
  check_log = NULL,
  remove_assay_na = TRUE,
  remove_invalid_oid = TRUE,
  remove_dup_sample_id = TRUE,
  remove_control_assay = TRUE,
  remove_control_sample = TRUE,
  remove_qc_warning = TRUE,
  remove_assay_warning = TRUE,
  control_sample_ids = NULL,
  convert_df_cols = TRUE,
  convert_nonunique_uniprot = TRUE,
  out_df = "tibble",
  verbose = FALSE
)

Arguments

df

A "tibble" or "ArrowObject" from read_npx.

check_log

A named list returned by check_npx(). If NULL, check_npx() will be run internally using df.

remove_assay_na

Logical. If FALSE, skips filtering assays with all quantified values NA. Defaults to TRUE.

remove_invalid_oid

Logical. If FALSE, skips filtering assays with invalid identifiers. Defaults to TRUE.

remove_dup_sample_id

Logical. If FALSE, skips filtering samples with duplicate sample identifiers. Defaults to TRUE.

remove_control_assay

If FALSE, all internal control assays are retained. If TRUE, all internal control assays are removed. Alternatively, a character vector with one or more of "assay", "inc", "det", "ext", and "amp" indicating the assay types to remove.

remove_control_sample

If FALSE, all control samples are retained. If TRUE, all control samples are removed. Alternatively, a character vector with one or more of "sample", "sc", "pc", "nc", "calibrator", and "other" indicating the sample types to remove.

remove_qc_warning

Logical. If FALSE, retains samples flagged as FAIL in QC warning. Defaults to TRUE.

remove_assay_warning

Logical. If FALSE, retains assays flagged as WARN in assay warning. Defaults to TRUE.

control_sample_ids

character vector of sample identifiers of control samples. Default NULL, to mark no samples to be removed.

convert_df_cols

Logical. If FALSE, retains columns of df as are. Defaults to TRUE, were columns required for downstream analysis are converted to the expected format.

convert_nonunique_uniprot

Logical. If FALSE, retains non-unique OlinkID - UniProt mapping. Defaults to TRUE.

out_df

The class of the output dataset. One of "tibble" or "arrow". Defaults to "tibble".

verbose

Logical. If FALSE (default), silences step-wise messages.

Details

The pipeline performs the following steps:

  1. Remove assays with invalid identifiers: assays flagged as having invalid identifiers from check_npx(). Occurs when the original data set provided by Olink Software has been modified.

  2. Remove assays with NA quantification values: assays lacking quantification data are reported with NA as quantification. These assays are identified in check_npx().

  3. Remove samples with duplicate identifiers: samples with identical identifiers detected by check_npx(). Instances of duplicate sample identifiers cause errors in the downstream analysis of data with, and it is highly discouraged.

  4. Remove external control samples:

    • Uses column marking sample type (e.g. SampleType) to exclude external control samples.

    • Uses column marking sample identifier (e.g. SampleID) to remove external control samples, or samples that ones wants to exclude from the downstream analysis.

  5. Remove samples failing quality control: samples with QC status FAIL.

  6. Remove internal control assays: Uses column marking assay type (e.g. AssayType) to exclude internal control assays.

  7. Remove assays with quality controls warnings: assays with QC status WARN.

  8. Correct column data type: ensure that certain columns have the expected data type (class). These columns are identified in check_npx().

  9. Resolve multiple UniProt mappings per assay: ensure that each assay identifier (e.g., OlinkID) maps uniquely to a single UniProt ID.

Important:

  • When data set lacks a column marking sample type (e.g. SampleType), one should remove external control samples based on their sample identifiers. This function does not auto-detect external control samples based on their sample identifiers. Please ensure external control samples have been removed prior to downstream statistical analysis.

  • When data set lacks a column marking assay type (e.g. AssayType), one should remove internal control assays manually. This function does not auto-detect internal control assays. Please ensure internal control assays have been removed prior to downstream statistical analysis.

Value

Dataset, "tibble" or "ArrowObject", with Olink data in long format.

Author(s)

Kang Dong Klev Diamanti

Examples

## Not run: 
# run check_npx
check_log <- check_npx(
  df = npx_data1
)

# run clean_npx
clean_npx(
  df = npx_data1,
  check_log = check_log
)

# run clean_npx with messages for all steps
clean_npx(
  df = npx_data1,
  check_log = check_log,
  verbose = TRUE
)

## End(Not run)


OlinkAnalyze documentation built on June 24, 2026, 1:06 a.m.