nanny: high-level data analysis and manipulation in tidyverse style"

Lifecycle:maturing

Functions/utilities available

Function | Description ------------ | ------------- reduce_dimensions | Perform dimensionality reduction (PCA, MDS, tSNE) rotate_dimensions | Rotate two dimensions of a degree cluster_elements | Labels elements with cluster identity remove_redundancy | Filter out elements with highly correlated features fill_missing | Fill values of missing element/feature pairs impute_missing | Impute values of missing element/feature pairs permute_nest | From one column build a two permuted columns with nested information combine_nest | From one column build a two combination columns with nested information keep_variable | Keep top variable features lower_triangular | keep rows corresponding to a lower triangular matrix

Utilities | Description ------------ | ------------- as_matrix | Robustly convert a tibble to matrix subset| Select columns with information relative to a column of interest

Minimal input data frame

element | feature | value ------------ | ------------- | ------------- chr or fctr | chr or fctr | numeric

Output data frame

element | feature | value | new information ------------ | ------------- | ------------- | ------------- chr or fctr | chr or fctr | numeric | ...

library(knitr)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
                      message = FALSE, cache.lazy = FALSE)

library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(magrittr)
library(nanny)

my_theme =  
    theme_bw() +
    theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size=12),
        legend.position="bottom",
        aspect.ratio=1,
        strip.background = element_blank(),
        axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
        axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
    )

Installation

devtools::install_github("stemangiola/nanny")

Introduction

nanny is a collection of wrapper functions for high level data analysis and manipulation following the tidy paradigm.

Tidy data

mtcars_tidy = 
    mtcars %>% 
    as_tibble(rownames="car_model") %>% 
    mutate_at(vars(-car_model,- hp, -vs), scale) %>%
    gather(feature, value, -car_model, -hp, -vs)

mtcars_tidy

reduce_dimensions

We may want to reduce the dimensions of our data, for example using PCA, MDS of tSNE algorithms. reduce_dimensions takes a tibble, column names (as symbols; for element, feature and value) and a method (e.g., MDS, PCA or tSNE) as arguments and returns a tibble with additional columns for the reduced dimensions.

MDS

mtcars_tidy_MDS =
  mtcars_tidy %>%
  reduce_dimensions(car_model, feature, value, method="MDS", .dims = 3)

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

mtcars_tidy_MDS %>% subset(car_model)  %>% select(contains("Dim"), everything())

mtcars_tidy_MDS %>%
    subset(car_model) %>%
  GGally::ggpairs(columns = 4:6, ggplot2::aes(colour=factor(vs)))

PCA

mtcars_tidy_PCA =
  mtcars_tidy %>%
  reduce_dimensions(car_model, feature, value, method="PCA", .dims = 3)

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

mtcars_tidy_PCA %>% subset(car_model) %>% select(contains("PC"), everything())

mtcars_tidy_PCA %>%
     subset(car_model) %>%
  GGally::ggpairs(columns = 4:6, ggplot2::aes(colour=factor(vs)))

tSNE

mtcars_tidy_tSNE =
    mtcars_tidy %>% 
    reduce_dimensions(car_model, feature, value, method = "tSNE")

Plot

mtcars_tidy_tSNE %>%
    subset(car_model) %>%
    select(contains("tSNE"), everything()) 

mtcars_tidy_tSNE %>%
    subset(car_model) %>%
    ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=factor(vs))) + geom_point() + my_theme

rotate_dimensions

We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. rotate_dimensions takes a tibble, column names (as symbols; for element, feature and value) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as <NAME OF DIMENSION> rotated <ANGLE> by default, or as specified in the input arguments.

mtcars_tidy_MDS.rotated =
  mtcars_tidy_MDS %>%
    rotate_dimensions(`Dim1`, `Dim2`, .element = car_model, rotation_degrees = 45, action="get")

Original On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.

mtcars_tidy_MDS.rotated %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs) )) +
  geom_point() +
  my_theme

Rotated On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.

mtcars_tidy_MDS.rotated %>%
    ggplot(aes(x=`Dim1 rotated 45`, y=`Dim2 rotated 45`, color=factor(vs) )) +
  geom_point() +
  my_theme

cluster_elements

We may want to cluster our data (e.g., using k-means element-wise). cluster_elements takes as arguments a tibble, column names (as symbols; for element, feature and value) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.

k-means

mtcars_tidy_cluster = mtcars_tidy_MDS %>%
  cluster_elements(car_model, feature, value, method="kmeans",  centers = 2, action="get" )

We can add cluster annotation to the MDS dimesion reduced data set and plot.

 mtcars_tidy_cluster %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=cluster_kmeans)) +
  geom_point() +
  my_theme
mtcars_tidy_SNN =
    mtcars_tidy_tSNE %>%
    cluster_elements(car_model, feature, value, method = "SNN")
mtcars_tidy_SNN %>%
    subset(car_model) %>%
    select(contains("tSNE"), everything()) 

mtcars_tidy_SNN %>%
    subset(car_model) %>%
    ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=cluster_SNN)) + geom_point() + my_theme

drop_redundant

We may want to remove redundant elements from the original data set (e.g., elements or features), for example if we want to define cell-type specific signatures with low element redundancy. remove_redundancy takes as arguments a tibble, column names (as symbols; for element, feature and value) and returns a tibble dropped recundant elements (e.g., elements). Two redundancy estimation approaches are supported:

removal of highly correlated clusters of elements (keeping a representative) with method="correlation"

mtcars_tidy_non_redundant =
    mtcars_tidy_MDS %>%
  remove_redundancy(car_model, feature, value)

We can visualise how the reduced redundancy with the reduced dimentions look like

mtcars_tidy_non_redundant %>%
    subset(car_model) %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs))) +
  geom_point() +
  my_theme
mtcars_tidy_non_redundant =
    mtcars_tidy_MDS %>%
  remove_redundancy(
    car_model, feature, value, 
    method = "reduced_dimensions",
    Dim_a_column = `Dim1`,
    Dim_b_column = `Dim2`
  )
mtcars_tidy_non_redundant %>%
    subset(car_model) %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=factor(vs))) +
  geom_point() +
  my_theme

fill_missing

This function allows to obtain a rectangular underlying data structure, where every element has one feature, filling missing element/feature pairs with a value of choice (e.g., 0)

We create a non-rectangular data frame

mtcars_tidy_non_rectangular = mtcars_tidy %>% slice(-1)

We fill the missing value with the value of 0

mtcars_tidy_non_rectangular %>% fill_missing(car_model, feature, value, fill_with = 0)

impute_missing

This function allows to obtain a rectangular underlying data structure, where every element has one feature, imputig missing element/feature pairs with a function of choice (e.g., median)

We impute the missing value with the a summary value (median by default) according to a grouping

mtcars_tidy_non_rectangular %>% mutate(vs = factor(vs)) %>% 
    impute_missing( car_model, feature, value,  ~ vs) %>%

    # Print imputed first
    arrange(car_model != "Mazda RX4" | feature != "mpg")

permute_nest

From one column build a two permuted columns with nested information

mtcars_tidy_permuted = 
    mtcars_tidy %>%
    permute_nest(car_model, c(feature,value))

mtcars_tidy_permuted

combine_nest

From one column build a two combination columns with nested information

mtcars_tidy %>%
    combine_nest(car_model, value)

lower_triangular

keep rows corresponding to a lower triangular matrix

mtcars_tidy_permuted %>%

    # Summarise mpg
    mutate(data = map(data, ~ .x %>% filter(feature == "mpg") %>% summarise(mean(value)))) %>%
    unnest(data) %>%

    # Lower triangular
    lower_triangular(car_model_1, car_model_2,  `mean(value)`)

keep_variable

Keep top variable features

mtcars_tidy %>%
    keep_variable(car_model, feature, value, top=10)

as_matrix

Robustly convert a tibble to matrix

mtcars_tidy %>%
    select(car_model, feature, value) %>%
    spread(feature, value) %>%
    as_matrix(rownames = car_model) %>%
    head()

subset

Select columns with information relative to a column of interest

mtcars_tidy %>%
    subset(car_model)

nest_subset

Nest a data frame based on the columns with information relative to the column provided to nest

mtcars_tidy %>% nest_subset(data = -car_model)

ADD versus GET versus ONLY modes

Every function takes a tidyfeatureomics structured data as input, and (i) with action="add" outputs the new information joint to the original input data frame (default), (ii) with action="get" the new information with the element or feature relative informatin depending on what the analysis is about, or (iii) with action="only" just the new information. For example, from this data set

  mtcars_tidy

action="add" (Default) We can add the MDS dimensions to the original data set

  mtcars_tidy %>%
    reduce_dimensions(
        car_model, feature, value, 
        method="MDS" ,
        .dims = 3,
        action="add"
    )

action="get" We can add the MDS dimensions to the original data set selecting just the element-wise column

  mtcars_tidy %>%
    reduce_dimensions(
        car_model, feature, value, 
        method="MDS" ,
        .dims = 3,
        action="get"
    )

action="only" We can get just the MDS dimensions relative to each element

  mtcars_tidy %>%
    reduce_dimensions(
        car_model, feature, value, 
        method="MDS" ,
        .dims = 3,
        action="only"
    )


Try the nanny package in your browser

Any scripts or data that you put into this service are public.

nanny documentation built on July 1, 2020, 10:20 p.m.