isoDeconvMM: Cell Type Deconvolution using RNA Isoform-Level Expression

Description Usage Arguments Value

Description

Calculates the proportions of pure cell type components in heterogeneous cell type samples of RNA-seq data utilizing isoform-level expression differences

Usage

1
2
3
4
IsoDeconvMM(directory = NULL, mix_files, pure_ref_files, fraglens_files,
  bedFile, knownIsoforms, discrim_genes, readLen, lmax = 600,
  eLenMin = 1, mix_names = NULL, initPts = NULL,
  optim_options = optimControl())

Arguments

directory

an optional character string denoting the path to the directory where all of the mix_files, pure_ref_files, fraglens_files, and bedfile are located. The working directory is set as this directory. If this directory is left 'NULL', then all of the relevent files must either (a) be located in the current working directory or (b) have their full path specified.

mix_files

a vector of the file names for the text files recording the number of RNA-seq fragments per exon set, which should have 2 columns "count" and "exons", without header. For example: 37 chr18_109|ENSMUSG00000024491|4;chr18_109|ENSMUSG00000024491|5; 17 chr18_109|ENSMUSG00000024491|5; 88 chr18_109|ENSMUSG00000024491|5;chr18_109|ENSMUSG00000024491|6; There should be one file for each of the samples containing mixtures of cells. The second column lists exon sets, where “chr18_109” indicates a transcript cluster, “ENSMUSG00000024491” is the ensemble gene ID, and the numbers at the end is the exon ID Directions to create these count files can be found in the Step_0_Processes directory of the GitHub repo hheiling/deconvolution <https://github.com/hheiling/deconvolution>

pure_ref_files

a matrix where the first column is the file names for the text files recording the number of RNA-seq fragments per exon set (see 'mix_files' for additional description), one for each of the pure reference cell type samples (again, see the Step_0_Processes directory in <https://github.com/hheiling/deconvolution> for directions on how to create these files) and the second column contains the character names of the pure cell type associated with each sample

fraglens_files

a vector of the file names for the text files recording the distribution of the fragment lengths, which should have 2 columns: "Frequency" and "Length", without header. For example: 20546 75 40465 76 37486 77 27533 78 25344 79 Directions to create these fragment length files are also available in the Step_0_Processes directory in the GitHub repo hheiling/deconvoltuion, <https://github.com/hheiling/deconvolution>

bedFile

file name of the .bed file recording information of non-overlapping exons, which has 6 colums: "chr", "start", "end", "exon", "score", and "strand", without header. For example: chr1 3044314 3044814 ENSMUSG00000090025:1 666 + chr1 3092097 3092206 ENSMUSG00000064842:1 666 + Directions to create this .bed file can be found in the Create_BED_knownIsoforms_Files directory in the GitHub repo hheiling/deconvolution, <https://github.com/hheiling/deconvolution>

knownIsoforms

character string for the name of an .RData object that contains the known isoform information. When loaded, this object is a list where each component is a binary matrix that specifies a set of possible isoforms (e.g., isoforms from annotations). Specifically, it is a binary matrix of k rows and m columns, where k is the number of non-overlapping exons and m is the number of isoforms. isoforms[i,j]=1 indicates that the i-th exon belongs to the j-th isoform. For example, the following matrix indicates the three isoforms for one gene ENSMUSG00000000003: ENSMUST00000000003 ENSMUST00000166366 ENSMUST00000114041 [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 [4,] 1 1 0 [5,] 1 1 1 [6,] 1 1 1 [7,] 1 1 1 [8,] 1 0 0 Instructions for creating such an RData object can be found in the Create_BED_knownIsoforms_Files directory in the GitHub repo hheiling/deconvolution, <https://github.com/hheiling/deconvolution>

discrim_genes

vector of genes that are suspected to have differential gene expression. This gene list could come from CuffLinks output, isoform package output, or something similar.

readLen

numeric value of the length of a read in the RNAseq experiment

lmax

numeric value of the maximum fragment length of the experiment

eLenMin

numeric value of the minimum value of effective length. If the effective length of an exon or exon junction is smaller than eLenMin, i.e., if this exon is not included in the corresponding isoform, set it to eLenMin. This is to account for possible sequencing error or mapping errors.

mix_names

an optional vector of the desired nicknames of the mixture samples corresponding, in the same order, to the mix_files list. If left as the default NULL value, the nicknames used will be the names given in the mix_files minus the .txt extension

initPts

an optional matrix of initial probability estimates for the cell composition of the mixture samples to be used in the optimization procedure. The matrix should have J columns, where J = number of pure cell types of interest. Each row corresponds to different combinations of initial probability values. The column names of the matrix must be provided and must correspond to the pure cell type names given in the second column of the pure_ref_files object (no particular ordering needed)

optim_options

a list inheriting from class optimControl containing optimization control parameters. See the function optimControl for more details.

Value

A list object with the following structure: first layer of list has elements associated with each of the mixture samples; second layer of list as elements associated with each transcript cluster used in the analysis, determined by the genes in the discrim_gene vector. Each of these transcript cluster elements is itself a list with the following elements:

info
candiIsoform
I

Number of isoforms utilized in transcript cluster

E

Number of exons in transcript cluster

X

ExI matrix of effective lengths for each of the E exon sets within each of the I isoforms

info_status
y_mix, other y vectors for each pure cell type reference sample

Ex1 vectors of read count at each exon set for the given mixture or pure cell type sample

countN_mix, other countN values for each pure cell type reference sample
mix

a list with the elements rds_exons_t (vector of length E+1 where the last E elements are y_mix, and the first element is the total read counts for the mixture sample minus the sum of y_mix), gamma.est ((I-1)xK matrix of isoform expression parameters for each cell type k), tau.est (vector of length K of gene expression parameters in cell type k), p.est (vector of length K containing estimated proportions based on the given transcript cluster), and pm.rds.exons (ExK matrix containing posterior means for each of E exon sets in each of K cell types)

"cellType1","cellType2" ...
l_tilde

Ix1 vector of total effective lengths of each of the I isoforms; Each elemement of the vector, denoted l_i, is a column sum from the matrix X

X.fin

edited design matrix for new gamma parameters, where the ith column of the new matrix is X.fin_i = (X_i-[l_i/l_I]X_I) for i = 1,...,(I-1) and X.fin_I = X_I/l_I

X.prime

first (I-1) columns X.fin pertaining to gamma parameters

alpha.est

IxK hyperparameters governing average isoform expression levels and variances within cells of type k

beta.est

2xK hyperparameters governing gene expression levels within cells type k

CellType_Order

For outputs giving K different estimates for each of the K cell types, these outputs are ordered with respect to CellType_Order

WARN

An integer indicating the following information: 0 - Optimization Complete 1 - Iteration Limit Reached 4 - Error in Optimization Routine (Error in mixture sample fit) 5 - Optimization not conducted (Error in pure sample fit)


hheiling/IsoDeconvMM documentation built on March 11, 2020, 7:28 p.m.