Pediatric cancers

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(dplyr)
library(ggplot2)
library(recipes)
library(scimo)

theme_set(theme_light())

data("pedcan_expression")

Dataset

pedcan_expression contains the expression of 108 cell lines from 5 different pediatric cancers. Additionally, it includes information on the sex of the original donor, the type of cancer it represents, and whether it is a primary tumor or a metastasis.

pedcan_expression
count(pedcan_expression, disease, sort = TRUE)

Dimension reduction

One approach to exploring this dataset is by performing PCA.

rec_naive_pca <-
  recipe(pedcan_expression) %>% 
  update_role(-cell_line) %>% 
  step_zv(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors()) %>% 
  prep()

rec_naive_pca %>% 
  bake(new_data = NULL) %>% 
  ggplot() +
  aes(x = PC1, y = PC2, color = disease) +
  geom_point()

To improve the appearance of PCA, one can precede it with a feature selection step based on the coefficient of variation. Here, step_select_cv keeps only one fourth of the original features.

rec_cv_pca <-
  recipe(pedcan_expression) %>% 
  update_role(-cell_line) %>% 
  step_select_cv(all_numeric_predictors(), prop_kept = 1/4) %>% 
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors()) %>%
  prep()

rec_cv_pca %>% 
  bake(new_data = NULL) %>% 
  ggplot() +
  aes(x = PC1, y = PC2, color = disease) +
  geom_point()

The tidy method allows to see which features are kept.

tidy(rec_cv_pca, 1)


Try the scimo package in your browser

Any scripts or data that you put into this service are public.

scimo documentation built on June 24, 2024, 5:17 p.m.