USCbiostats/partition: Agglomerative partitioning framework for dimension reduction of high-dimensional genomic datasets

A common feature across genomic data types, including genome, epigenome, transcriptome, microbiome, metabolome, etc., is dependencies among variables. Improvements in genomic technologies accompanied by decreasing costs have led to vastly increasing amounts of information collected from individual tissue samples. However, this increase in information is often accompanied by increasing dependencies among variables. This dynamic has fueled the need for methods to reduce dimensionality of datasets by summarizing multiple dependent variables into fewer and less dependent variables. Dimension reduction has multiple benefits including reduced computational demands, reduced multiple-testing challenge, better-behaved data, and possible increase in statistical power to detect associations with external variables. Algorithms included here use an agglomerative partitioning framework and share the following goals, 1) minimum information loss given the achieved reduction in dimensionality, 2) each original variable maps to one and only one variable in the reduced dataset, 3) a user specified maximum amount of information loss. The framework can be described as a partitioning of the original features into subsets of similar variables with a function applied to each subset to summarize it into a single new variable. Each partition/new variable pair satisfies a maximum information loss criterion, and the overall goal is to minimize the number of partitions subject to that criterion.

Getting started

Package details

Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
USCbiostats/partition documentation built on July 8, 2018, 8:16 a.m.