In this repository is located the R package ORdensity that implements the statistical method presented in the paper Identification of differentially expressed genes by means of outlier detection by Irigoien and Arenas, BMC Bioinformatics 2018 ([1]).
An important issue in microarray data is to select, from thousands of genes, a small number of informative differentially expressed (DE) genes that may be key elements for a disease. If each gene is analyzed individually, there is a big number of hypotheses to test and a multiple comparison correction method must be used. Consequently, the resulting cut-off value may be too small. Moreover, an important issue is the selection’s replicability of the DE genes. The package ORdensity is designed to obtain a reproducible selection of DE genes by the method presented in [1], which is not a gene-by-gene approach. The core function 'findDEgenes' provides three measures related to the concepts of outlier and density of false positives in a neighbourhood, which allow identify the DE genes with high classification accuracy. The first measure is an index called OR and previously introduced in [2, 3]; the other two measures called FP and dFP were introduced in [1]. Additional functions provided in this package like 'preclusteredData' and 'plotFPvsOR' facilitate exploring and understanding the results. As, working with large datasets, long execution times and great computational efforts are required, parallelization strategies were used to perform the analysis in a short time.
[1] Irigoien I, Arenas C. Identification of differentially expressed genes by means of outlier detection. 19(1): 317:1–317:20 (2018)
[2] Arenas C, Toma C, Cormand B, Irigoien I. Identifying extreme observations, outliers and noise in clinical and genetic data. Current Bioinformatics 2017;12(2):101–17.
[3] Arenas C, Irigoien I, Mestres F, Toma C, Cormand B. Extreme observations in biomedical data. In: Ainsbury EA, Calle ML, Cardis E, et al., editors. Extended Abstracts Fall 2015. Trends in Mathematics vol 7. Birkhäuser, Cham: Springer; 2017. p. 3–8.
To install the package from this repository, just run the following code
library('devtools')
install_github('jmartinezot/ORdensity')
This package requires the cluster
library to be installed; otherwise it will automatically install and load it. Likewise, the foreach
library is used for parallelization.
To start working with the package, just load it in the R enviroment with the following command
library('ORdensity')
There is a example dataframe called simexpr
shipped with the package. This data is the result of a simulation of 100 differentially expressed genes in a pool of 1000 genes. It contains 1000 observations of 62 variables. Each row correspond to a gene and contains 62 values: DEgen, gap and the values for the gene expression in 30 positive cases and in 30 negative cases. The DEgen field value is 1 for differentially expressed genes and 0 for those which are not.
First, let us extract the samples from each experimental condition from the example
database.
x <- simexpr[, 3:32]
y <- simexpr[, 33:62]
EXC.1 <- as.matrix(x)
EXC.2 <- as.matrix(y)
To create an S4 object to perform the analysis, follow this command
myORdensity <- new("ORdensity", Exp_cond_1 = EXC.1, Exp_cond_2 = EXC.2)
By default, no parallelizing is enabled. To enable it, just run instead
myORdensity <- new("ORdensity", Exp_cond_1 = EXC.1, Exp_cond_2 = EXC.2, parallel = TRUE)
It is also possible to enable or disable replicability, and to pass the seed to the pseudorandom number generator. The default values are
myORdensity <- new("ORdensity", Exp_cond_1 = EXC.1, Exp_cond_2 = EXC.2, replicable = TRUE, seed = 0)
with the function using the given seed to set the random generator. If replicable = FALSE, no seed is used.
If the researcher just wants to extract the differentially expressed genes detected by the ORdensity method, a call to findDEgenes will return a list with the clusters found, along with their mean value of the OR statistic. Higher OR values mean higher probability of true differentially expressed.
For example, after running this code
result <- findDEgenes(myORdensity)
the method indicated that the optimal clustering consists of just two clusters,
The ORdensity method has found that the optimal clustering of the data consists of 2 clusters
we could then look the number of genes in each cluster genes and the mean values of index OR in each cluster
> result
[[1]]
[[1]]$cluster_number
[1] 1
[[1]]$numberOfGenes
[1] 84
[[1]]$meanOR
[1] 62.99879
[[2]]
[[2]]$cluster_number
[1] 2
[[2]]$numberOfGenes
[1] 23
[[2]]$meanOR
[1] 10.96129
The clusters are ordered in decreasing order according to the value of the mean of the OR statistic. We see that the mean is higher in the first cluster (62.99879) than in the second one (10.96129), which means that the first cluster is more likely composed of true differentially expressed genes, and the second one to be composed of false positives. With more clusters, the last ones are likely false negatives.
We could also check a more detailed summary of the objec and obtain the genes identify as DE genes. Following [1], two types of differentially expressed gene selection can be made:
The motivation of the clustering is to distinguish those false positives that score high in OR and low in meanFP and density, but are similar to other known false positives obtained by boostrapping. The procedure is detailed in [1] and it uses the PAM cluster procedure.
summary(myORdensity)
The output would be the following
This is the proposed clustering made by the ORdensity method
For the computation of FP and dFP a total of 10 neighbours have been taken into account
The expected number of false positives neighbours is 8.237232
The ORdensity method has found that the optimal clustering of the data consists of 2 clusters
$neighbours
[1] 10
$expectedFalsePositiveNeighbours
[1] 8.237232
$clusters
$clusters[[1]]
id OR FP dFP
[1,] 62 175.04322 0.0 0.0000000
[2,] 50 172.28779 0.0 0.0000000
[3,] 61 155.53626 0.0 0.0000000
[4,] 70 152.54705 0.0 0.0000000
[5,] 2 149.65335 0.0 0.0000000
[6,] 68 148.87294 0.0 0.0000000
[7,] 32 144.80790 0.0 0.0000000
[8,] 7 143.42201 0.0 0.0000000
[9,] 52 134.87709 0.0 0.0000000
[10,] 10 130.61843 0.0 0.0000000
[11,] 36 120.82917 0.0 0.0000000
[12,] 40 104.99403 0.0 0.0000000
[13,] 93 103.66296 0.0 0.0000000
[14,] 65 98.36178 0.0 0.0000000
[15,] 82 93.45266 0.0 0.0000000
[16,] 15 93.16310 0.0 0.0000000
[17,] 64 92.59491 0.0 0.0000000
[18,] 24 92.20146 0.0 0.0000000
[19,] 73 91.67733 0.0 0.0000000
[20,] 67 88.44531 0.0 0.0000000
[21,] 29 88.43950 0.0 0.0000000
[22,] 6 84.47275 0.0 0.0000000
[23,] 17 84.10083 0.0 0.0000000
[24,] 34 77.32266 0.0 0.0000000
[25,] 23 76.09034 0.0 0.0000000
[26,] 71 74.37630 0.0 0.0000000
[27,] 5 73.18440 0.0 0.0000000
[28,] 53 69.74061 0.0 0.0000000
[29,] 11 68.71638 0.0 0.0000000
[30,] 89 64.38506 0.0 0.0000000
[31,] 37 64.09250 0.0 0.0000000
[32,] 28 62.75227 0.0 0.0000000
[33,] 26 58.38549 0.0 0.0000000
[34,] 81 54.30241 0.0 0.0000000
[35,] 91 52.91573 0.0 0.0000000
[36,] 25 51.70272 0.0 0.0000000
[37,] 21 50.88095 0.0 0.0000000
[38,] 99 48.72145 0.0 0.0000000
[39,] 97 48.44279 0.0 0.0000000
[40,] 3 46.00194 0.0 0.0000000
[41,] 92 45.90385 0.0 0.0000000
[42,] 58 45.50605 0.0 0.0000000
[43,] 42 44.77756 0.0 0.0000000
[44,] 87 44.27434 0.0 0.0000000
[45,] 39 44.21818 0.0 0.0000000
[46,] 16 43.84997 0.0 0.0000000
[47,] 90 42.72040 0.0 0.0000000
[48,] 9 42.01739 0.0 0.0000000
[49,] 48 41.82011 0.0 0.0000000
[50,] 22 40.45328 0.0 0.0000000
[51,] 1 40.13427 0.0 0.0000000
[52,] 83 39.63938 0.0 0.0000000
[53,] 76 38.54992 0.0 0.0000000
[54,] 14 38.54946 0.0 0.0000000
[55,] 8 35.75173 0.0 0.0000000
[56,] 80 34.41224 0.0 0.0000000
[57,] 13 33.95962 0.0 0.0000000
[58,] 63 65.35801 0.1 0.2393753
[59,] 54 59.93344 0.1 0.2820190
[60,] 47 38.60668 0.1 0.2667662
[61,] 79 35.04591 0.1 0.3444442
[62,] 86 31.39571 0.1 0.2931306
[63,] 72 54.06801 0.2 0.4522637
[64,] 4 38.79233 0.2 0.7073464
[65,] 35 34.61744 0.2 0.6733581
[66,] 88 35.14700 0.3 1.4112193
[67,] 100 30.78572 0.3 1.1781804
[68,] 56 34.25862 0.4 1.3966969
[69,] 84 31.73809 0.6 1.9543445
[70,] 20 29.47171 1.1 3.2552493
[71,] 66 24.24462 1.4 4.0342057
[72,] 95 19.42965 1.6 5.6854818
[73,] 60 23.04710 1.8 6.0885440
[74,] 12 20.62520 2.2 6.9215094
[75,] 59 18.74276 2.5 10.1692216
[76,] 27 24.79341 2.6 6.9045174
[77,] 69 22.76688 2.7 8.3615886
[78,] 98 18.32814 2.8 11.2156424
[79,] 49 23.11792 3.0 9.1467849
[80,] 55 26.19269 3.3 10.7893489
[81,] 30 18.37483 3.5 7.9978672
[82,] 57 18.91166 3.6 12.7433027
[83,] 96 16.85913 3.8 14.9634715
[84,] 46 14.63207 3.8 13.7055582
$clusters[[2]]
id OR FP dFP
[1,] 74 17.892081 4.9 14.84705
[2,] 31 17.597946 5.1 20.46774
[3,] 45 14.734285 5.2 22.29119
[4,] 85 13.488933 5.2 21.39663
[5,] 33 11.898146 5.9 22.35787
[6,] 78 10.339745 6.0 38.41902
[7,] 43 14.577235 6.3 28.06259
[8,] 19 11.400038 6.6 31.42799
[9,] 18 9.955954 6.6 38.49118
[10,] 75 14.767943 6.9 33.21473
[11,] 51 11.009624 7.4 45.27537
[12,] 94 8.452682 8.0 60.76025
[13,] 44 9.085678 8.3 57.26337
[14,] 399 7.992158 8.3 68.18916
[15,] 104 9.543673 8.5 58.97102
[16,] 38 8.201157 8.9 76.21379
[17,] 41 14.155428 9.5 33.61409
[18,] 598 7.385797 9.8 45.59191
[19,] 277 8.133859 10.0 37.81764
[20,] 618 8.115440 10.0 85.55168
[21,] 946 8.108665 10.0 58.13935
[22,] 651 7.659571 10.0 56.17629
[23,] 670 7.613712 10.0 84.82799
As a rule of thumb, differentially expressed genes are expected to present high values of OR and low values of meanFP and density. We could also analyze each gene individually inside each cluster. The motivation of the clustering is to distinguish those false positives that score high in OR and low in meanFP and density, but are similar to other known false positives obtained by boostrapping. The procedure is detailed in the paper referenced above.
If the researcher is interested in a more thorough analysis, other functions are at their service.
The data before being clustered can be obtained with the following function
preclusteredData(myORdensity)
Columns "Strong" and "Flexible" show the genes identified as DE genes
They denote the strong selection (FP=0) with S and the flexible selection (FP < expectedFalsePositives) with F
id OR FP dFP Strong Flexible
62 Gene62 175.043223 0.0 0.0000000 S F
50 Gene50 172.287790 0.0 0.0000000 S F
61 Gene61 155.536259 0.0 0.0000000 S F
70 Gene70 152.547051 0.0 0.0000000 S F
2 Gene2 149.653354 0.0 0.0000000 S F
68 Gene68 148.872937 0.0 0.0000000 S F
32 Gene32 144.807897 0.0 0.0000000 S F
7 Gene7 143.422005 0.0 0.0000000 S F
52 Gene52 134.877088 0.0 0.0000000 S F
10 Gene10 130.618427 0.0 0.0000000 S F
36 Gene36 120.829166 0.0 0.0000000 S F
40 Gene40 104.994030 0.0 0.0000000 S F
92 Gene93 103.662961 0.0 0.0000000 S F
65 Gene65 98.361779 0.0 0.0000000 S F
81 Gene82 93.452659 0.0 0.0000000 S F
15 Gene15 93.163103 0.0 0.0000000 S F
64 Gene64 92.594912 0.0 0.0000000 S F
24 Gene24 92.201463 0.0 0.0000000 S F
73 Gene73 91.677326 0.0 0.0000000 S F
67 Gene67 88.445307 0.0 0.0000000 S F
29 Gene29 88.439498 0.0 0.0000000 S F
6 Gene6 84.472748 0.0 0.0000000 S F
17 Gene17 84.100832 0.0 0.0000000 S F
34 Gene34 77.322660 0.0 0.0000000 S F
23 Gene23 76.090339 0.0 0.0000000 S F
71 Gene71 74.376304 0.0 0.0000000 S F
5 Gene5 73.184401 0.0 0.0000000 S F
53 Gene53 69.740607 0.0 0.0000000 S F
11 Gene11 68.716381 0.0 0.0000000 S F
88 Gene89 64.385061 0.0 0.0000000 S F
37 Gene37 64.092504 0.0 0.0000000 S F
28 Gene28 62.752272 0.0 0.0000000 S F
26 Gene26 58.385493 0.0 0.0000000 S F
80 Gene81 54.302410 0.0 0.0000000 S F
90 Gene91 52.915735 0.0 0.0000000 S F
25 Gene25 51.702722 0.0 0.0000000 S F
21 Gene21 50.880954 0.0 0.0000000 S F
98 Gene99 48.721447 0.0 0.0000000 S F
96 Gene97 48.442793 0.0 0.0000000 S F
3 Gene3 46.001939 0.0 0.0000000 S F
91 Gene92 45.903855 0.0 0.0000000 S F
58 Gene58 45.506047 0.0 0.0000000 S F
42 Gene42 44.777559 0.0 0.0000000 S F
86 Gene87 44.274336 0.0 0.0000000 S F
39 Gene39 44.218182 0.0 0.0000000 S F
16 Gene16 43.849973 0.0 0.0000000 S F
89 Gene90 42.720397 0.0 0.0000000 S F
9 Gene9 42.017388 0.0 0.0000000 S F
48 Gene48 41.820112 0.0 0.0000000 S F
22 Gene22 40.453281 0.0 0.0000000 S F
1 Gene1 40.134270 0.0 0.0000000 S F
82 Gene83 39.639383 0.0 0.0000000 S F
76 Gene76 38.549919 0.0 0.0000000 S F
14 Gene14 38.549461 0.0 0.0000000 S F
8 Gene8 35.751730 0.0 0.0000000 S F
79 Gene80 34.412237 0.0 0.0000000 S F
13 Gene13 33.959616 0.0 0.0000000 S F
63 Gene63 65.358008 0.1 0.2393753 F
54 Gene54 59.933443 0.1 0.2820190 F
47 Gene47 38.606681 0.1 0.2667662 F
78 Gene79 35.045910 0.1 0.3444442 F
85 Gene86 31.395715 0.1 0.2931306 F
72 Gene72 54.068005 0.2 0.4522637 F
4 Gene4 38.792334 0.2 0.7073464 F
35 Gene35 34.617445 0.2 0.6733581 F
87 Gene88 35.147004 0.3 1.4112193 F
99 Gene100 30.785716 0.3 1.1781804 F
56 Gene56 34.258620 0.4 1.3966969 F
83 Gene84 31.738086 0.6 1.9543445 F
20 Gene20 29.471714 1.1 3.2552493 F
66 Gene66 24.244624 1.4 4.0342057 F
94 Gene95 19.429654 1.6 5.6854818 F
60 Gene60 23.047096 1.8 6.0885440 F
12 Gene12 20.625201 2.2 6.9215094 F
59 Gene59 18.742760 2.5 10.1692216 F
27 Gene27 24.793410 2.6 6.9045174 F
69 Gene69 22.766881 2.7 8.3615886 F
97 Gene98 18.328136 2.8 11.2156424 F
49 Gene49 23.117921 3.0 9.1467849 F
55 Gene55 26.192692 3.3 10.7893489 F
30 Gene30 18.374832 3.5 7.9978672 F
57 Gene57 18.911661 3.6 12.7433027 F
95 Gene96 16.859134 3.8 14.9634715 F
46 Gene46 14.632071 3.8 13.7055582 F
74 Gene74 17.892081 4.9 14.8470517 F
31 Gene31 17.597946 5.1 20.4677432 F
45 Gene45 14.734285 5.2 22.2911883 F
84 Gene85 13.488933 5.2 21.3966265 F
33 Gene33 11.898146 5.9 22.3578662 F
77 Gene78 10.339745 6.0 38.4190204 F
43 Gene43 14.577235 6.3 28.0625912 F
19 Gene19 11.400038 6.6 31.4279856 F
18 Gene18 9.955954 6.6 38.4911843 F
75 Gene75 14.767943 6.9 33.2147288 F
51 Gene51 11.009624 7.4 45.2753660 F
93 Gene94 8.452682 8.0 60.7602533 F
44 Gene44 9.085678 8.3 57.2633727
102 Gene399 7.992158 8.3 68.1891610
100 Gene104 9.543673 8.5 58.9710191
38 Gene38 8.201157 8.9 76.2137941
41 Gene41 14.155428 9.5 33.6140872
103 Gene598 7.385797 9.8 45.5919137
101 Gene277 8.133859 10.0 37.8176444
104 Gene618 8.115440 10.0 85.5516822
107 Gene946 8.108665 10.0 58.1393519
105 Gene651 7.659571 10.0 56.1762919
106 Gene670 7.613712 10.0 84.8279861
A plot with a representation of the potential genes based on OR (vertical axis), FP (horizontal axis) and dFP (size of the circle is inversely proportional to its value) can also be obtained. The plot is similar to Fig.3b in [1].
plotFPvsOR(myORdensity)
By default, the number of clusters computed by the ORdensity method is used. Other values for the number of clusters can be specified.
plotFPvsOR(myORdensity, k = 5)
It is also possible to see a graphic representation of the clustering projected onto the first two principal components
clusplotk(myORdensity)
The plot of k values against the silhouette measure is also provided.
silhouetteAnalysis(myORdensity)
Other number of clusters can also be checked
clusplotk(myORdensity, k = 4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.