Filtering

Share:

Description

Filtering of genes and features with low expression. Additionally, for the dmSQTLdata object, filtering of genotypes with low frequency.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
dmFilter(x, ...)

## S4 method for signature 'dmDSdata'
dmFilter(x, min_samps_gene_expr, min_samps_feature_expr,
  min_samps_feature_prop, min_gene_expr = 10, min_feature_expr = 10,
  min_feature_prop = 0, max_features = Inf)

## S4 method for signature 'dmSQTLdata'
dmFilter(x, min_samps_gene_expr, min_samps_feature_expr,
  min_samps_feature_prop, minor_allele_freq, min_gene_expr = 10,
  min_feature_expr = 10, min_feature_prop = 0, max_features = Inf,
  BPPARAM = BiocParallel::MulticoreParam(workers = 1))

Arguments

x

dmDSdata or dmSQTLdata object.

...

Other parameters that can be defined by methods using this generic.

min_samps_gene_expr

Minimal number of samples where genes should be expressed. See Details.

min_samps_feature_expr

Minimal number of samples where features should be expressed. See Details.

min_samps_feature_prop

Minimal number of samples where features should be expressed. See details.

min_gene_expr

Minimal gene expression.

min_feature_expr

Minimal feature expression.

min_feature_prop

Minimal proportion for feature expression. This value should be between 0 and 1.

max_features

Maximum number of features, which pass the filtering criteria, that should be kept for each gene. If equal to Inf, all features that pass the filtering criteria are kept.

minor_allele_freq

Minimal number of samples where each of the genotypes has to be present.

BPPARAM

Parallelization method used by bplapply.

Details

Filtering parameters should be adjusted according to the sample size of the experiment data and the number of replicates per condition.

min_samps_gene_expr defines the minimal number of samples where genes are required to be expressed at the minimal level of min_gene_expr in order to be included in the downstream analysis. Ideally, we would like that genes were expressed at some minimal level in all samples because this would lead to good estimates of feature ratios.

Similarly, min_samps_feature_expr and min_samps_feature_prop defines the minimal number of samples where features are required to be expressed at the minimal levels of counts min_feature_expr or proportions min_feature_prop. In differential splicing analysis, we suggest using min_samps_feature_expr and min_samps_feature_prop equal to the minimal number of replicates in any of the conditions. For example, in an assay with 3 versus 5 replicates, we would set these parameters to 3, which allows a situation where a feature is expressed in one condition but may not be expressed at all in another one, which is an example of differential splicing.

By default, we do not use filtering based on feature proportions. Therefore, min_samps_feature_prop and min_feature_prop equals 0.

In sQTL analysis, usually, we deal with data that has many more replicates than data from a standard differential splicing assay. Our example data set consists of 91 samples. Requiring that genes are expressed in all samples may be too stringent, especially since there may be missing values in the data and for some genes you may not observe counts in all 91 samples. Slightly lower threshold ensures that we do not eliminate such genes. For example, if min_samps_gene_expr = 70 and min_gene_expr = 10, only genes with expression of at least 10 in at least 70 samples are kept. Samples with expression lower than 10 have NAs assigned and are skipped in the analysis of this gene. minor_allele_freq indicates the minimal number of samples for the minor allele presence. Usually, it is equal to 5% of total samples.

Value

Returns filtered dmDSdata or dmSQTLdata object.

Author(s)

Malgorzata Nowicka

See Also

data_dmDSdata, data_dmSQTLdata, plotData, dmDispersion, dmFit, dmTest

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
###################################
### Differential splicing analysis
###################################

d <- data_dmDSdata

### Filtering
# Check what is the minimal number of replicates per condition 
table(samples(d)$group)
d <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3, 
 min_samps_feature_prop = 0)
plotData(d)

#############################
### sQTL analysis
#############################
# If possible, use BPPARAM = BiocParallel::MulticoreParam() with more workers

d <- data_dmSQTLdata

### Filtering
d <- dmFilter(d, min_samps_gene_expr = 70, min_samps_feature_expr = 5, 
 min_samps_feature_prop = 0, minor_allele_freq = 5, 
 BPPARAM = BiocParallel::SerilaParam())
plotData(d)