Description Usage Arguments Details Value References Examples
View source: R/dacomp_main_function.R
The function tests taxa for differential abundance, given a set of reference taxa used for normalization, as described in Brill et. al. (2019). The function supports several tests, by type of phenotype: two sample tests, K-sample tests, tests for association with a continuous phenotype and user defined tests. See 'details' below on how the input argument y
shold be formatted for different tests.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
X |
Matrix of counts, with rows representing samples, columns representing taxa. See 'details' for additional information on how to format this matrix for paired study designs. |
y |
Vector of phenotype values by sample. See details on different formats used by different tests. |
ind_reference_taxa |
One of two options: Object of type 'dacomp.reference.selection.object' returned from |
test |
One of the values in the vector |
q |
The required FDR level for the DS-FDR algorithm, see Jiang et. al. (2017) for details. |
nr_perm |
Number of permuations used for testing and computing P-values. The default number of permutations is set high enough to provide powerful inference after adjusting for multiplicity. Change the number of permutations only if you know how it effects power after correcting for multiplicity. |
disable_DSFDR |
Can be used to disable the DS-FDR computation (which may take up to a minute on large datasets). The default value is |
user_defined_test_function |
Argument for inputing a function for a custom test of association supplied by the user, see 'details' below. |
compute_ratio_normalization |
Argument determines if in addition to the test described above, the function should a run test based on normalization by division rather than rarefaction. See description under details. |
verbose |
Should messages be printed to console, indicating computation progress. The default value is |
DSFDR_Filter |
A logical (boolean) vector specifying for which taxa should the DSFDR adjusted P-values be computed. Taxa whose corresponding entries in this vector are set to are excluded from testing, and will not have DSFDR adjusted Pvalues computed. If a taxon is set to be in the reference set, and its corresponding entry is set to |
Test_All |
A logical value specifying if all test (including the reference taxa) should be tested for differential abundance. When testing a reference taxon for differential abundance, it is excluded from the reference set. |
return_rarefied_values |
a logical value indicating if the rarefied draws used for testing should be returned as a matrix under the entry |
The function tests each taxon not in the reference set for differential abundance as follows. For each taxon, the following procedure is performed. First, an identical number of reads is taken from each sample, from the reads available under the reference set of taxa, and the taxon being tested. Next, a test of association is performed between the rarefied reads and the phenotype given by y
. P-values are computed by permutations. Finally, after all P-values are computed, the DS-FDR threshold for rejection is computed. Hypotheses with P-values lower than the DS-FDR rejection threshold are rejected. Reference taxa are not tested for differential abundance. See vignette('dacomp_main_vignette')
for additional details and examples.
The function supports several tests. Different tests require different formats for the arguments X
and y
.
Two Sample Tests (treatment vs. control, healthy vs.sick):
DACOMP.TEST.NAME.WILCOXON - (Preferred option for 2-sample testing) A Wilcoxon rank sum test between the rarefied reads from the two sample groups. y
is any vector with two levels. For two sample testing this is the preferred test.
DACOMP.TEST.NAME.DIFFERENCE_IN_MEANS - A two sample test for difference in means, based on permutation. y
formated as for the Wilcoxon rank sum test.
DACOMP.TEST.NAME.LOG_FOLD_DIFFERENCE_IN_MEANS - A two sample test for difference in means of the logatihm of counts, based on permutation. A pseudocount of 1 is added to the rarefied reads, so that the lograithm can be computed. y
formated as for the Wilcoxon rank sum test.
DACOMP.TEST.NAME.TWO_PART_WILCOXON - The two part test of Wagner et. al. (2011), for equality of distributions between two sample groups. y
formated as for the Wilcoxon rank sum test.
DACOMP.TEST.NAME.WELCH - Welch t.test over the rarefied counts. P-values are computed by permutations. y
formated as for the Wilcoxon rank sum test.
DACOMP.TEST.NAME.WELCH_LOGSCALE - Welch t.test over the logarithm of rarefied counts (after a pseudocount of 1 is added). P-values are computed by permutations. y
formated as for the Wilcoxon rank sum test.
Paired Design Tests:
DACOMP.TEST.NAME.WILCOXON_SIGNED_RANK_TEST - The Wilcoxon sign rank test for paired designs. For this test, X
is formatted with the first n/2 rows corresponding to the samples measured at condition 1, and the latter n/2 rows measured at condition 2, i.e. rows 1 and n/2 + 1 correspond to the same physical sample, under two conditions. For this test, y
is set to NULL
.
Tests of Association for Ordinal/Continuous Phenotypes:
DACOMP.TEST.NAME.SPEARMAN - Test of association with a continuous, univariate phenotype. y
is the measured phenotype across samples. The test performed is a permutation based test for the Spearman correlation coefficient with a two sided alternative.
K-Sample Tests:
DACOMP.TEST.NAME.KRUSKAL_WALLIS - Test for equality of distributions between K sample groups. y
should be contian the group labeling of different observations.
User Defined Tests:
DACOMP.TEST.NAME.USER_DEFINED - Indicates that a custom test is supplied using the argument user_defined_test_function
. The supplied function will receive a single argument, the vector of rarefied counts and will return an array of length nr_perm +1
, containing the test statistic computed for the original data, along with test statistics computed for permuted phenotypes. Test statistics must have a right sided alternative. A complete example with code snippets is found in vignette('dacomp_main_vignette')
.
The parameter compute_ratio_normalization
computes in addition to the test described above, a test of association between the phenotype y
and the proportion of a tested taxon out of a subvector containing only the tested taxon and a set of reference taxa. This procedure, described as DACOMP-ratio in Brill et al. (2019), may enjoy higher power to detect differentially abundant taxa with a low number of counts, due to avoiding rarefaction. However, for scenarios with an extreme change in the microbial load of the measured ecology, e.g., a four-fold change in the microbial load between different study groups, this variant of DACOMP may have a slightly inflated FDR. See method description and simulation results in the paper for additional details.
We suggest using this function toghether with the function dacomp.validate_references
. The function dacomp.validate_references
is meant to detect if differentially abundant taxa have entered the reference set, and if so, reselect the reference set. We suggest that results be reported twice: once using the original reference selected, and once using the re-selected reference set. If results agree, than it is more likely that the initial reference set selected did not feature a contamination. If results show disagreement, this could be due to a signal entering the reference set.
An object of type "dacomp.result.object", which is a list with the follow fields:
lambda - The subsampling depth for each tested taxon, as in step I under 'details'.
stats_matrix - A matrix of size (nr_perm+1) X ncol(X)
containing the tests statistics for the data (first row) and test statistics computed for permuted values of y
(other rows).
p.values.test - A vector with P-values for the different tests of association, by taxa. P-values obtained by permutations. P-values for reference taxa will appear as NA
.
p.values.test.adjusted - A vector of P-values, adjusted for multiplicity, corresponding to p.values.test
. By default, correction is done by the DS-FDR method. If disable_DSFDR
is set to TRUE
, the BH correction is performed. Entries lower than a value of u indicate a taxon declared differentially abundant when trying to control multiplicity at level u. Entries lower than the value defined for the parameter q
will be indicated as discoveries under dsfdr_rejected
effect_size_estimates - For correlation tests (Spearman), will give the spearman correlation coeffcient between rarefied counts and Y. For K-sample, 2-Sample and paired tests, will provide a string describing the ordering of mean ranks of rarefied counts, across levels of Y.
dsfdr_rejected - A vector of taxa indices declared differentially abundant by the DS-FDR method for multiplicity adjustment. This field will not be available if disable_DSFDR
is set to TRUE
.
dsfdr_threshold - The selected threshold, in terms of P-values, for declaring taxa as differentialy abundant. Taxa with P-values under this threshold will be declared diffentially abundant. This field will not be available if disable_DSFDR
is set to TRUE
.
p.values.test.ratio.normalization - A vector with P-values for the different tests of association, by taxa, for the "normalization by ratio" variant of DACOMP. This field will be available only if compute_ratio_normalization
is set to TRUE
.
p.values.test.adjusted.ratio.normalization - A vector of P-values, adjusted for multiplicity, for the tests performed using ratio normalization (given by p.values.test.ratio.normalization
). By default, correction is done by the DS-FDR method. If disable_DSFDR
is set to TRUE
, the BH correction is performed. Entries lower than a value of u indicate a taxon declared differentially abundant when trying to control multiplicity at level u. Entries lower than the value defined for the parameter q
will be indicated as discoveries under dsfdr_rejected_ratio_normalization
effect_size_estimates_ratio - Similar to effect_size_estimates
, but computed over the ratio between taxon counts and total number of reads available under both the reference taxa and the tested taxon.
dsfdr_rejected_ratio_normalization -A vector of taxa indices declared differentially abundant by the DS-FDR method, similar to dsfdr_rejected, but using the P-values obtained for the "normalization by ratio" test. This field will be available only if compute_ratio_normalization
is set to TRUE
and DS-FDR computation is not disabled.
dsfdr_threshold_ratio_normalization - The selected threshold, in terms of P-values, for declaring taxa as differentialy abundant. Taxa with P-values, obtained with "normalization by ratio" type tests, under this threshold will be declared diffentially abundant. This field will be available only if compute_ratio_normalization
is set to TRUE
and DS-FDR computation is not disabled.
rarefied_counts- The rarefied counts for each taxon-wise test, rows are samples, columns are taxa.
Brill, Barak, Amnon Amir, and Ruth Heller. 2019. Testing for Differential Abundance in Compositional Counts Data, with Application to Microbiome Studies. arXiv Preprint arXiv:1904.08937.
Jiang, L, A Amir, JT Morton, R Heller, E Arias-Castro, and R Knight. 2017. Discrete False-Discovery Rate Improves Identification of Differentially Abundant Microbes. mSystems 2: E00092-17. Am Soc Microbiol.
Wagner, Brandie D, Charles E Robertson, and J Kirk Harris. 2011. Application of Two-Part Statistics for Comparison of Sequence Variant Counts. PloS One 6 (5). Public Library of Science: e20296.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | ## Not run:
library(dacomp)
set.seed(1)
# example for a study with two groups:
data = dacomp.generate_example_dataset.two_sample(m1 = 100,
n_X = 50,
n_Y = 50,
signal_strength_as_change_in_microbial_load = 0.1)
#select references: (may take a minute)
result.selected.references = dacomp.select_references(X = data$counts,
minimal_TA = 50, #Choosing the minimal number of reference taxa so that at least 50 reads are available under the reference for all samples
verbose = T)
length(result.selected.references$selected_references)
#plot the reference selection scores (can also be used to better set the median SD threshold)
dacomp.plot_reference_scores(result.selected.references)
#multiplicity correction levels for the BH and DS-FDR methods
q_BH = q_DSFDR = 0.1
#Perform testing:
result.test = dacomp.test(X = data$counts,
y = data$group_labels,
ind_reference_taxa = result.selected.references,
test = DACOMP.TEST.NAME.WILCOXON,nr_perm = 1000,
verbose = T,q = q_DSFDR)
rejected_BH = which(p.adjust(result.test$p.values.test,method = 'BH')<=q_BH)
rejected_DSFDR = result.test$dsfdr_rejected
#We note that our reference set may be contaminated, i.e. include some differentially abundant taxa. Therefore, we run a diagnostic check, trying to remove possible signals from the reference, and retesting:
Cleaned_references = dacomp.validate_references(X = data$counts,
Y = data$group_labels,
ref_obj = result.selected.references,
test =DACOMP.TEST.NAME.WILCOXON,
Q_validation = 0.1,
Minimal_Counts_in_ref_threshold = 10,
Reduction_Factor = 0.9,
Verbose = T,
disable_DSFDR = T,
NR_perm = 10000)
result.test.with.reduced.reference = dacomp.test(X = data$counts,
y = data$group_labels,
ind_reference_taxa = result.selected.references,
test = DACOMP.TEST.NAME.WILCOXON,nr_perm = 1000,
verbose = T,q = q_DSFDR)
rejected_BH = which(p.adjust(result.test.with.reduced.reference$p.values.test,method = 'BH')<=q_BH)
rejected_DSFDR = result.test.with.reduced.reference$dsfdr_rejected
# example with continous outcome
set.seed(1)
data = dacomp.generate_example_dataset_continuous(n = 100,m1 = 30,
signal_strength_as_change_in_microbial_load = 0.2)
result.selected.references = dacomp.select_references(X = data$counts,
minimal_TA = 50, #Choosing the minimal number of reference taxa so that at least 50 reads are available under the reference for all samples
verbose = T)
#number of selected references
length(result.selected.references$selected_references)
#plot the reference selection scores (can also be used to better set the median SD threshold)
dacomp.plot_reference_scores(result.selected.references)
#multiplicity correction levels for the BH and DS-FDR methods
q_BH = q_DSFDR = 0.1
#Perform testing:
result.test = dacomp.test(X = data$counts,
y = data$covariate,test = DACOMP.TEST.NAME.SPEARMAN,
ind_reference_taxa = result.selected.references,nr_perm = 1000,
verbose = T,q = q_DSFDR)
rejected_BH = which(p.adjust(result.test$p.values.test,method = 'BH')<=q_BH)
rejected_DSFDR = result.test$dsfdr_rejected
#We note that our reference set may be contaminated, i.e. include some differentially abundant taxa. Therefore, we run a diagnostic check, trying to remove possible signals from the reference, and retesting:
Cleaned_references = dacomp.validate_references(X = data$counts,
Y = data$covariate,
ref_obj = result.selected.references,
test = DACOMP.TEST.NAME.SPEARMAN,
Q_validation = 0.1,
Minimal_Counts_in_ref_threshold = 10,
Reduction_Factor = 0.9,
Verbose = T,
disable_DSFDR = T,
NR_perm = 1000)
result.test.with.reduced.reference = dacomp.test(X = data$counts,
y = data$covariate,
ind_reference_taxa = Cleaned_references,
test = DACOMP.TEST.NAME.SPEARMAN,nr_perm = 1000,
verbose = T,q = q_DSFDR)
rejected_BH = which(p.adjust(result.test.with.reduced.reference$p.values.test,method = 'BH')<=q_BH)
rejected_DSFDR = result.test.with.reduced.reference$dsfdr_rejected
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.