In russHyde/polyply: Manipulate Multiple Data-frames In A Pipeline
wzxhzdk:0
## Background and Links:
- @POG_LRC / @GUcancersci / @bloodwise_uk
- I'm a postdoc bioinformatician at The Paul O'Gorman (POG) Leukaemia
Research Centre (University of Glasgow)
- ... working for Prof. Mhairi Copland (POG) and Dr. David Vetrie
(Wolfson-Wohl Cancer Research Centre)
- ... on a Bloodwise-funded grant
- ... into chronic-myeloid leukaemia
- @haematobot
- Personal mumblings about code / analysis / bioinformatics and seemingly
very little else ...
- https://biolearnr.blogspot.com/
- Even more mumblings
## Preamble
See `https://github.com/russHyde/polyply`
wzxhzdk:1
# Data-Modelling
## Tidy Data and the Normal-Forms {.build}
In tidy data:
- TD1 - Each variable forms a column.
- TD2 - Each observation forms a row.
- TD3 - Each type of observational unit forms a table.
- [TD4 - A key permitting table-joins is present]
See also, Boyce-Codd
Normal-Forms and relational-database-design.
- ?? TD5 - A tidy way of encapsulating your nicely decomposed tables
- ?? TD6 - An explicit workflow for combining your tables back together
## Common _Untidy_ Data Structures
Tidy-data / normal-forms in R
- $\downarrow$ duplication
- play nicely with some important things (`ggplot2` etc)
But untidy data-structures are useful if they:
- $\uparrow$ access efficiency
- $\downarrow$ code complexity
- play nicely with other important things
## `Biobase::ExpressionSet`
wzxhzdk:2
wzxhzdk:3
Figure made with `DiagrammeR`
## `Biobase::ExpressionSet` (cont.)
Conversion of the `assayData` to meet tidy-data standards:
wzxhzdk:4
wzxhzdk:5
Doesn't meet tidy-data standards:
- rows correspond to features, columns to samples
- not all variables are in columns (since row-IDs are meaningful)
- entries are the same 'type' of variable
----
Easy fix:
wzxhzdk:6
## But ...
- Matrix representation was more dense
- Lost all encapsulation
- (After modifying featureData / phenoData to match)
- Have to join rather than index
- Have to keep track of multiple data-frames, rather than one
data-structure
## That multi-data-frame _thing_
For a reasonably complex project:
- tidy-data / normal-forms mean more data-frames
Wanted:
- a lightweight approach to working with multiple 'conceptually-related'
data-frames
- that plays nicely with `tidyverse` verbs
- that feeds into `ggplot2`
- that plays nicely with untidy data-structures I use _all the time_
# `tidygraph` already (sort of) does this
## Graph theory
wzxhzdk:7
## Basics of 'graph theory' speak
A graph is made up of two sets:
- _V_, a set of vertices:
- aka nodes, actors, ...
- _E_, a set of edges:
- pairwise relationships between vertices
- aka interactions, lines, arcs, ...
- Need to store attributes for both nodes and edges
## `tbl_graph` data structure
`tidygraph` is really a wrapper around the package `igraph`
wzxhzdk:8
## `tbl_graph` data structure
wzxhzdk:9
## The `activate` verb
Think of the `tbl_graph` as `list[nodes, edges]`
To modify the contents of a given data-frame, `activate` it:
wzxhzdk:10
# `polyply` and multiple, linked data-frames
## `polyply` {.build}
Aim:
- multiple data-frames in one data-structure
- $\rightarrow$ class `poly_frame`: extends list`
- `poly_frame`: [list[data-frame], merge_fn]
- mutation / filtering
- merging
## Exported functions
- `as_poly_frame`
- convert a data-structure into a `poly_frame`
- `activate`
- choose a data-frame from within the `poly_frame`
- `filter`
- modify the contents of the active data-frame
- `merge`
- user defined data-frame combiner (default: reduce(inner_join)(df_list))
- others to be added (mutate / select etc)
# Examples
## ExpressionSet Example
wzxhzdk:11
wzxhzdk:12
wzxhzdk:13
## Construct a poly-frame from an ExpressionSet
wzxhzdk:14
## What did we just make?
wzxhzdk:15
## Filter and plot:
wzxhzdk:16
## Filter and plot(cont.)
wzxhzdk:17
## Taxonomy and brains
wzxhzdk:18
## Taxonomies (cont.)
wzxhzdk:19
## Taxonomies (cont.)
wzxhzdk:20
## Taxonomies & brains (cont.)
wzxhzdk:21
# Thanks