library(SomaDataIO)
library(withr)
Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>"
)

Overview

Occasionally, additional clinical data is obtained after samples have been submitted to SomaLogic, Inc. or even after 'SomaScan' results have been delivered.

This requires the new clinical, i.e. non-proteomic, data to be merged with the 'SomaScan' data into a "new" ADAT prior to analysis. For this purpose, a command-line-interface ("CLI") tool has been included with SomaDataIO in the cli/merge/ directory, which allows one to generate an updated *.adat file via the command-line without having to launch an integrated development environment ("IDE"), e.g. RStudio.

To use SomaDataIOs exported functionality from within an R session, please see merge_clin().


Setup

The clinical merge tool is an R script that comes with an installation of SomaDataIO:

dir(system.file("cli", "merge", package = "SomaDataIO", mustWork = TRUE))

merge_script <- system.file("cli/merge", "merge_clin.R", package = "SomaDataIO")
merge_script

First create a temporary "analysis" directory:

analysis_dir <- tempfile(pattern = "somascan-")
# create directory
dir.create(analysis_dir)

# sanity check
dir.exists(analysis_dir)

# copy merge tool into analysis directory
file.copy(merge_script, to = analysis_dir)

Create Example Data

Let's create some dummy 'SomaScan' data derived from the example_data object from SomaDataIO. First we reduce its size to 9 samples and 5 proteomic features, and then write to text file in our new analysis directory with write_adat(). This will be the "new" starting point for the clinical data merge and represents where customers would typically begin an analysis.

feats <- withr::with_seed(3, sample(getAnalytes(example_data), 5L))
sub_adat <- dplyr::select(example_data, PlateId, SlideId, Subarray,
                          SampleId, Age, all_of(feats)) |> head(9L)
withr::with_dir(analysis_dir,
  write_adat(sub_adat, file = "ex-data-9.adat")
)

Next we create random clinical data with a common key (this is typically the SampleId identifier but it could be any common key).

df <- data.frame(SampleId = as.character(seq(1, 9, by = 2)),  # common key
                 group    = c("a", "b", "a", "b", "a"),
                 newvar   = withr::with_seed(1, rnorm(5)))
df

# write clinical data to file
withr::with_dir(analysis_dir,
  write.csv(df, file = "clin-data.csv", row.names = FALSE)
)

At this point there are now 3 files in our analysis directory:

dir(analysis_dir)
  1. merge_clin.R the merge script engine itself
  2. clin-data.csv:
    • new data containing 3 columns:
    • a common key: SampleId
    • a new variable with grouping information: group
    • a new variable with a continuous variable: newvar
  3. ex-data-9.adat:
    • ADAT with 9 samples containing 5 'SomaScan' proteomic features and 5 pre-existing variables we would like to merge into
    • PlateId, SlideId, Subarray, SampleId, and Age
    • note: PlateId, SlideId, and Subarray are key fields common to almost all ADATs; removing them could yield unintended results
    • the common key SampleId is required

Merging Clinical Data

The clinical data merge tool is simple to use at most common command line terminals (BASH, ZSH, etc.). You must have R installed (and available) with SomaDataIO and its dependencies installed.

Arguments

The merge script takes 4 (four), ordered arguments:

  1. path to the original ADAT (*.adat) file
  2. path to clinical data (*.csv) file
  3. common key variable name (e.g. SampleId)
  4. output file name (*.adat) for new ADAT

Standard Syntax

The primary syntax is for when the common key in both files, (ADAT and CSV), has the same variable name:

# change directory to the analysis path
cd `r analysis_dir`

# run the Rscript:
# - we recommend using the --vanilla flag
Rscript --vanilla merge_clin.R ex-data-9.adat clin-data.csv SampleId ex-data-9-merged.adat
withr::with_dir(analysis_dir,
  base::system2(
    "Rscript",
    c("--vanilla",
      "merge_clin.R",
      "ex-data-9.adat",
      "clin-data.csv",
      "SampleId",
      "ex-data-9-merged.adat")
  )
)
dir(analysis_dir)

Alternative Syntax

In certain instances you may have the common key under a different variable name in their respective files. This is handled by a modification to argument 3, which now takes the form key1=key2 where key1 contains the common keys in the *.adat file, and key2 contains keys for the *.csv file.

To highlight this syntax, first let's create a new clinical data file with a different variable name, ClinID:

# rename original `df`
names(df) <- c("ClinID", "letter", "size")
df

# write clinical data to file
withr::with_dir(analysis_dir,
  write.csv(df, file = "clin-data2.csv", row.names = FALSE)
)

We can now execute the same merge script at the command line with a slightly modified syntax:

Rscript --vanilla merge_clin.R ex-data-9.adat clin-data2.csv SampleId=ClinID ex-data-9-merged2.adat
withr::with_dir(analysis_dir,
  base::system2(
    "Rscript",
    c("--vanilla",
      "merge_clin.R",
      "ex-data-9.adat",
      "clin-data2.csv",
      "SampleId=ClinID",
      "ex-data-9-merged2.adat")
  )
)
dir(analysis_dir)

Check Results

Now let's check that the clinical data was merged successfully and yields the expected *.adat:

new <- withr::with_dir(analysis_dir,
  read_adat("ex-data-9-merged2.adat")
)
new

getMeta(new)

getAnalytes(new)

Summary

if ( dir.exists(analysis_dir) ) {
  unlink(analysis_dir, force = TRUE)
}


SomaLogic/SomaDataIO documentation built on Feb. 8, 2025, 12:19 p.m.