dmFilter: Filtering

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Filtering of genes and features with low expression. Additionally, for the dmSQTLdata object, filtering of genotypes with low frequency.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
dmFilter(x, ...)

## S4 method for signature 'dmDSdata'
dmFilter(x, min_samps_gene_expr = 0,
  min_samps_feature_expr = 0, min_samps_feature_prop = 0,
  min_gene_expr = 0, min_feature_expr = 0, min_feature_prop = 0,
  run_gene_twice = FALSE)

## S4 method for signature 'dmSQTLdata'
dmFilter(x, min_samps_gene_expr = 0,
  min_samps_feature_expr = 0, min_samps_feature_prop = 0,
  minor_allele_freq = 0.05 * nrow(samples(x)), min_gene_expr = 0,
  min_feature_expr = 0, min_feature_prop = 0,
  BPPARAM = BiocParallel::SerialParam())

Arguments

x

dmDSdata or dmSQTLdata object.

...

Other parameters that can be defined by methods using this generic.

min_samps_gene_expr

Minimal number of samples where genes should be expressed. See Details.

min_samps_feature_expr

Minimal number of samples where features should be expressed. See Details.

min_samps_feature_prop

Minimal number of samples where features should be expressed. See details.

min_gene_expr

Minimal gene expression.

min_feature_expr

Minimal feature expression.

min_feature_prop

Minimal proportion for feature expression. This value should be between 0 and 1.

run_gene_twice

Whether to re-run the gene-level filter after the feature-level filters.

minor_allele_freq

Minimal number of samples where each of the genotypes has to be present.

BPPARAM

Parallelization method used by bplapply.

Details

Filtering parameters should be adjusted according to the sample size of the experiment data and the number of replicates per condition.

min_samps_gene_expr defines the minimal number of samples where genes are required to be expressed at the minimal level of min_gene_expr in order to be included in the downstream analysis. Ideally, we would like that genes were expressed at some minimal level in all samples because this would lead to better estimates of feature ratios.

Similarly, min_samps_feature_expr and min_samps_feature_prop defines the minimal number of samples where features are required to be expressed at the minimal levels of counts min_feature_expr or proportions min_feature_prop. In differential transcript/exon usage analysis, we suggest using min_samps_feature_expr and min_samps_feature_prop equal to the minimal number of replicates in any of the conditions. For example, in an assay with 3 versus 5 replicates, we would set these parameters to 3, which allows a situation where a feature is expressed in one condition but may not be expressed at all in another one, which is an example of differential transcript/exon usage.

By default, all the filtering parameters equal zero which means that features with zero expression in all samples are removed as well as genes with only one non-zero feature.

In QTL analysis, usually, we deal with data that has many more replicates than data from a standard differential usage assay. Our example data set consists of 91 samples. Requiring that genes are expressed in all samples may be too stringent, especially since there may be missing values in the data and for some genes you may not observe counts in all 91 samples. Slightly lower threshold ensures that we do not eliminate such genes. For example, if min_samps_gene_expr = 70 and min_gene_expr = 10, only genes with expression of at least 10 in at least 70 samples are kept. Samples with expression lower than 10 have NAs assigned and are skipped in the analysis of this gene. minor_allele_freq indicates the minimal number of samples for the minor allele presence. Usually, it is equal to roughly 5% of total samples.

Value

Returns filtered dmDSdata or dmSQTLdata object.

Author(s)

Malgorzata Nowicka

See Also

plotData

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# --------------------------------------------------------------------------
# Create dmDSdata object 
# --------------------------------------------------------------------------
## Get kallisto transcript counts from the 'PasillaTranscriptExpr' package

library(PasillaTranscriptExpr)

data_dir  <- system.file("extdata", package = "PasillaTranscriptExpr")

## Load metadata
pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), 
header = TRUE, as.is = TRUE)

## Load counts
pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), 
header = TRUE, as.is = TRUE)

## Create a pasilla_samples data frame
pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, 
  group = pasilla_metadata$condition)
levels(pasilla_samples$group)

## Create a dmDSdata object
d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples)

## Use a subset of genes, which is defined in the following file
gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt"))

d <- d[names(d) %in% gene_id_subset, ]

# --------------------------------------------------------------------------
# Differential transcript usage analysis - simple two group comparison 
# --------------------------------------------------------------------------

## Filtering
## Check what is the minimal number of replicates per condition 
table(samples(d)$group)

d <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3,
  min_gene_expr = 10, min_feature_expr = 10)

plotData(d)

# --------------------------------------------------------------------------
# Create dmSQTLdata object
# --------------------------------------------------------------------------
# Use subsets of data defined in the GeuvadisTranscriptExpr package

library(GeuvadisTranscriptExpr)

geuv_counts <- GeuvadisTranscriptExpr::counts
geuv_genotypes <- GeuvadisTranscriptExpr::genotypes
geuv_gene_ranges <- GeuvadisTranscriptExpr::gene_ranges
geuv_snp_ranges <- GeuvadisTranscriptExpr::snp_ranges

colnames(geuv_counts)[c(1,2)] <- c("feature_id", "gene_id")
colnames(geuv_genotypes)[4] <- "snp_id"
geuv_samples <- data.frame(sample_id = colnames(geuv_counts)[-c(1,2)])

d <- dmSQTLdata(counts = geuv_counts, gene_ranges = geuv_gene_ranges,  
  genotypes = geuv_genotypes, snp_ranges = geuv_snp_ranges, 
  samples = geuv_samples, window = 5e3)

# --------------------------------------------------------------------------
# sQTL analysis - simple group comparison
# --------------------------------------------------------------------------

## Filtering
d <- dmFilter(d, min_samps_gene_expr = 70, min_samps_feature_expr = 5,
  minor_allele_freq = 5, min_gene_expr = 10, min_feature_expr = 10)
  
plotData(d)

gosianow/DRIMSeq documentation built on Aug. 8, 2020, 10:29 a.m.