pruneKnn: Function inferring a pruned knn matrix
In RaceID: Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

pruneKnn

R Documentation

Function inferring a pruned knn matrix

Description

This function determines k nearest neighbours for each cell in gene expression space, and tests if the links are supported by a negative binomial joint distribution of gene expression. A probability is assigned to each link which is given by the minimum joint probability across all genes.

Usage

pruneKnn(
  expData,
  distM = NULL,
  large = TRUE,
  regNB = TRUE,
  bmethod = NULL,
  batch = NULL,
  regVar = NULL,
  offsetModel = TRUE,
  thetaML = FALSE,
  theta = 10,
  ngenes = 2000,
  span = 0.75,
  pcaComp = NULL,
  tol = 1e-05,
  algorithm = "kd_tree",
  metric = "pearson",
  genes = NULL,
  knn = 25,
  do.prune = TRUE,
  alpha = 1,
  nb = 3,
  no_cores = NULL,
  FSelect = FALSE,
  pca.scale = FALSE,
  ps = 1,
  seed = 12345,
  theta.harmony = NULL,
  ...
)

Arguments

`expData`	Matrix of gene expression values with genes as rows and cells as columns. These values have to correspond to unique molecular identifier counts. Alternatively, a Seurat object could be used as input, after normalization, PCA-dimensional reduction, and shared-nearest neighbour inference.
`distM`	Optional distance matrix used for determining k nearest neighbours. Default is `NULL` and the distance matrix is computed using a metric given by the parameter `metric`.
`large`	logical. If `TRUE` then no distance matrix is required and nearest neighbours are inferred by the FNN package based on a reduced feature matrix computed by a principle component analysis. Only the first `pcaComp` principle components are considered. Prior to principal component analysis a negative binomial regression is performed to eliminate the dependence on the total number of transcripts per cell. The pearson residuals of this regression serve as input for the principal component analysis after smoothing the parameter dependence on the mean by a `loess` regression. Deafult is `TRUE`. Recommended mode for very large datasets, where storing a distance matrix requires too much memory. `distM` will be ignored if `large` is `TRUE`.
`regNB`	logical. If `TRUE` then gene a negative binomial regression is performed to prior to the principle component analysis if `large = TRUE`. See `large`. Otherwise, transcript counts in each cell are normalized to one, multipled by the minimal total transcript count across all cells, followed by adding a pseudocount of 0.1 and taking the logarithm. Default is `TRUE`.
`bmethod`	Character string indicating the batch correction method. If "harmony", then batch correction is performed by the harmony package. Default is `NULL` and batch correction will be done by negative binomial regression.
`batch`	vector of batch variables. Component names need to correspond to valid cell IDs, i.e. column names of `expData`. If `regNB` is `TRUE`, than the batch variable will be regressed out simultaneously with the log UMI count per cell. An interaction term is included for the log UMI count with the batch variable. Default value is `NULL`.
`regVar`	data.frame with additional variables to be regressed out simultaneously with the log UMI count and the batch variable (if `batch` is `TRUE`). Column names indicate variable names (name `beta` is reserved for the coefficient of the log UMI count), and rownames need to correspond to valid cell IDs, i.e. column names of `expData`. Interaction terms are included for each variable in `regVar` with the batch variable (if `batch` is `TRUE`). Default value is `NULL`.
`offsetModel`	Logical parameter. Only considered if `regNB` is `TRUE`. If `TRUE` then the `beta` (log UMI count) coefficient is set to 1 and the intercept is computed analytically as the log ration of UMI counts for a gene and the total UMI count across all cells. Batch variables and additional variables in `regVar` are regressed out with an offset term given by the sum of the intercept and the log UMI count. Default is `TRUE`.
`thetaML`	Logical parameter. Only considered if `offsetModel` equals `TRUE`. If `TRUE` then the dispersion parameter is estimated by a maximum likelihood fit. Otherwise, it is set to `theta`. Default is `FALSE`.
`theta`	Positive real number. Fixed value of the dispersion parameter. Only considered if `theaML` equals `FALSE`.
`ngenes`	Positive integer number. Randomly sampled number of genes (from rownames of `expData`) used for predicting regression coefficients (if `regNB=TRUE`). Smoothed coefficients are derived for all genes. Default is 2000.
`span`	Positive real number. Parameter for loess-regression (see `large`) controlling the degree of smoothing. Default is 0.75.
`pcaComp`	Positive integer number. Number of princple components to be included if `large` is `TRUE`. Default is `NULL` and the number of principal components used for dimensionality reduction of the feature matrix is derived by an elbow criterion. However, the minimum number of components will be set to 15 if the elbow criterion results in a smaller number. The derived number can be be plotted using the `plotPC` function.
`tol`	Numerical value greater than zero. Tolerance for numerical PCA using irlba. Default value is 1e-6.
`algorithm`	Algorithm for fast k nearest neighbour inference, using the `get.knn` function from the FNN package. See `help(get.knn)`. Deafult is "kd_tree".
`metric`	Distances are computed from the expression matrix `x` after optionally including only genes given as argument `genes` or after optional feature selection (see `FSelect`). Possible values for `metric` are `"pearson", "spearman", "logpearson", "euclidean"`. Default is `"pearson"`. In case of the correlation based methods, the distance is computed as 1 – correlation. This parameter is only used if `large` is FALSE and `distM` is NULL.
`genes`	Vector of gene names corresponding to a subset of rownames of `x`. Only these genes are used for the computation of a distance matrix and for the computation of joint probabilities of nearest neighbours. Default is `NULL` and all genes are used.
`knn`	Positive integer number. Number of nearest neighbours considered for each cell. Default is 25.
`do.prune`	Logical parameter. If `TRUE`, then pruning of k-nearest neighbourhoods is performed. If `FALSE`, then no pruning is done. Default is `TRUE`.
`alpha`	Positive real number. Relative weight of a cell versus its k nearest neigbour applied for the derivation of joint probabilities. A cell receives a weight of `alpha` while the weights of its k nearest neighbours as determined by quadratic programming sum up to one. The sum across all weights and alpha is normalized to one, and the weighted mean expression is used for computing the link porbabilities for each of the k nearest neighbours. Larger values give more weight to the gene expression observed in a cell versus its neighbourhood. Typical values should be in the range of 0 to 10. Default is value is 1. If `alpha` is set to NULL it is inferred by an optimization, i.e., `alpha` is minimized under the constraint that the gene expression in a cell does not deviate more then one standard deviation from the predicted weigthed mean, where the standard deviation is calculated from the predicted mean using the background model (the average dependence of the variance on the mean expression). This procedure is coputationally more intense and inceases the run time of the function significantly.
`nb`	Positive integer number. Number of genes with the lowest outlier probability included for calculating the link probabilities for the knn pruning. The link probability is computed as the geometric mean across these genes. Default is 3.
`no_cores`	Positive integer number. Number of cores for multithreading. If set to `NULL` then the number of available cores minus two is used. Default is `NULL`.
`FSelect`	Logical parameter. If `TRUE`, then feature selection is performed prior to distance matrix calculation and VarID analysis. Default is `FALSE`.
`pca.scale`	Logical parameter. If `TRUE`, then input features are scaled prior to PCA transformation. Default is `FALSE`.
`ps`	Real number greater or equal to zero. Pseudocount to be added to counts within local neighbourhoods for outlier identification and pruning. Default is 1.
`seed`	Integer number. Random number to initialize stochastic routines. Default is 12345.
`theta.harmony`	`theta` parameter of `RunHarmony` function from the harmony package (to avoid collision with the dispersion parameter `theta`). Default is NULL.
`...`	Additional parameters for `RunHarmony` function from the harmony package, if `batch` is not `NULL` and `bmethod="harmony"`.

Value

List object of six components:

`distM`	Distance matrix.
`dimRed`	PCA transformation of `expData` including the first `pcaComp` principle components, computed on including `genes` or variable genes only if `Fselect` equals `TRUE`. Is is set to `NULL` if `large` equals `FALSE`.
`pvM`	Matrix of link probabilities between a cell and each of its k nearest neighbours (Bonferroni-corrected p-values). Column `i` shows the k nearest neighbour link probabilities for cell `i` in matrix `x`.
`pvM.raw`	Matrix of uncorrected link probabilities between a cell and each of its k nearest neighbours (without multiple-testing correction). Column `i` shows the k nearest neighbour link probabilities for cell `i` in matrix `x`.
`NN`	Matrix of column indices of k nearest neighbours for each cell according to input matrix `x`. First entry corresponds to index of the cell itself. Columns contain the k nearest neighbour indices for cell `i` in matrix `x`.
`B`	List object with background model of gene expression as obtained by `fitBackVar` function.
`regData`	If `regNB=TRUE` this argument contains a list of four components: component `pearsonRes` contains a matrix of the Pearson Residual computed from the negative binomial regression, component `nbRegr` contains a matrix with the regression coefficients, component `nbRegrSmooth` contains a matrix with the smoothed regression coefficients, and `log_umi` is a vector with the total log UMI count for each cell. The regression coefficients comprise the dispersion parameter theta, the intercept, the regression coefficient beta for the log UMI count, and the regression coefficients of the batches (if `batch` is not `NULL`).
`alpha`	Vector of inferred values for the `alpha` parameter for each neighbourhood (if input parameter `alpha` is NULL; otherwise all values are equal to the input parameter).
`pars`	List object storing the run parameters.
`pca`	Principal component analysis of the of the input data, if `large` is TRUE. Output or the function `irlba` from the irlba package with `pcaComp` principal components, or 100 principal components if `pcaComp` is NULL.

Examples

res <- pruneKnn(intestinalDataSmall,knn=10,alpha=1,no_cores=1,FSelect=FALSE)

RaceID documentation built on April 4, 2025, 4:34 a.m.

RaceID index

Package overview RaceID/StemID/VarID reference manual

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RaceID
Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

pruneKnn: Function inferring a pruned knn matrix
In RaceID: Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

Function inferring a pruned knn matrix

Description

Usage

Arguments

Value

Examples

Related to pruneKnn in RaceID...

R Package Documentation

Browse R Packages

We want your feedback!

RaceID Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

pruneKnn: Function inferring a pruned knn matrix In RaceID: Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

Function inferring a pruned knn matrix

Description

Usage

Arguments

Value

Examples

Related to pruneKnn in RaceID...

R Package Documentation

Browse R Packages

We want your feedback!

RaceID
Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

pruneKnn: Function inferring a pruned knn matrix
In RaceID: Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data