The D2C (Dependence to Causality) algorithm aims to infer the existence of causal dependencies from static (i.e. non temporal) observational and multivariate data.
The D2C algorithm predicts the existence of a direct causal link between two variables in a multivariate setting by (i) creating a set of of features of the relationship based on asymmetric descriptors of the multivariate dependency and (ii) using a classifier (in this case a Random Forest) to learn a mapping between the features and the presence of a causal link. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected.
This vignette shows how to create a training set for D2C, how to create a D2C object
and how to assess it with a number of simulated datasets.
A comparison with two inference algorithms implemented in the bnearn
package is provided too.
Let us start by creating a set of Directed Acyclic Graphs (DAGs) and the associated simulated datasets.
The object simulatedDAG
can be initialized by setting
the number NDAG
of DAGs, the range noNodes
of the number of nodes, the range N
of observed samples, the type functionType
of dependency and the range sdn
of standard deviation
of the additive noise.
rm(list=ls()) library(D2C) require(RBGL) require(gRbase) noNodes<-c(10,20) ## range of number of nodes N<-c(50,200) ## range of number of samples sd.noise<-c(0.2,1) ## range of values for standard deviation of additive noise NDAG=100 ## number of DAGs to be created and simulated trainDAG<-new("simulatedDAG",NDAG=NDAG, N=N, noNodes=noNodes, functionType = c("linear","quadratic","sigmoid"), seed=0,sdn=sd.noise,quantize=c(TRUE,FALSE),verbose=FALSE)
We can now check the number of DAGs
print(trainDAG@NDAG)
If it happens that the number is smaller than NDAG
,
this is due to the fact that DAGs with no edges are removed.
Let us visualize also the first simulated DAG and the dimension of the related datasets:
print(trainDAG@list.DAGs[[1]]) print(dim(trainDAG@list.observationsDAGs[[1]]))
Once we have a training set of DAGs, we can create a D2C object by storing for (a subset of) the edges of each DAG a vector of descriptors and the related binary label (0 or 1) about the existance of a causal link.
The parameters of the D2C descriptors are contained in a D2C.descriptor
object
descr.example<-new("D2C.descriptor",bivariate=FALSE,ns=3,acc=TRUE,lin=TRUE) trainD2C<-new("D2C",sDAG=trainDAG, descr=descr.example,ratioEdges=1,max.features=30,verbose=FALSE)
Let us print some properties of the trainD2C
object:
print(dim(trainD2C@X)) print(table(trainD2C@Y))
Let us print out the Random forest object stored within trainD2C
print(trainD2C@mod)
Let us now create an independent test set (e.g. by creating
a second simulatedDAG
object with a different seed of the random generator)
and let us store it in the testDAG
object:
NDAG.test=50 noNodes<-c(10,20) ## range of number of nodes N<-c(50,100) testDAG<-new("simulatedDAG",NDAG=NDAG.test, N=N, noNodes=noNodes, functionType = c("linear","quadratic","sigmoid"), quantize=c(TRUE,FALSE), seed=101,sdn=c(0.2,0.5),verbose=FALSE)
In this section we assess the accuracy of D2C and we compare it with two algorithms implemented in the package bnlearn
. Accuracy is measured by the Balanced Error Rate (BER).
For each dataset associated to the simulated DAG contained in testDAG
,
we infer the directionality of the edges and we compare it with the ground truth.
require(foreach) if (!require(bnlearn)){ install.packages("bnlearn", repos="http://cran.rstudio.com/") library(bnlearn) } FF<-foreach (r=1:testDAG@NDAG) %do%{ set.seed(r) observedData<-testDAG@list.observationsDAGs[[r]] trueDAG<-testDAG@list.DAGs[[r]] ## inference of networks with bnlearn package Ahat.GS<-amat(gs(data.frame(observedData),alpha=0.01)) Ahat.IAMB<-(amat(iamb(data.frame(observedData),alpha=0.01))) graphTRUE<- as.adjMAT(trueDAG) ## selection of a balanced subset of edges for the assessment Nodes=nodes(trueDAG) max.edges<-min(30,length(edgeList(trueDAG))) subset.edges = matrix(unlist(sample(edgeList(trueDAG), size = max.edges,replace = F)),ncol=2,byrow = TRUE) subset.edges = rbind(subset.edges,t(replicate(n =max.edges, sample(Nodes,size=2,replace = FALSE)))) Yhat.D2C<-NULL Yhat.IAMB<-NULL Yhat.GS<-NULL Ytrue<-NULL for(jj in 1:NROW(subset.edges)){ i =as(subset.edges[jj,1],"numeric"); j =as(subset.edges[jj,2],"numeric") ; pred.D2C = predict(trainD2C,i,j, observedData) Yhat.D2C<-c(Yhat.D2C,as.numeric(pred.D2C$response) -1) Yhat.IAMB<-c(Yhat.IAMB,Ahat.IAMB[i,j]) Yhat.GS<-c(Yhat.GS,Ahat.GS[i,j]) Ytrue<-c(Ytrue,graphTRUE[subset.edges[jj,1],subset.edges[jj,2]]) } list(Yhat.D2C=Yhat.D2C,Yhat.GS=Yhat.GS, Yhat.IAMB=Yhat.IAMB,Ytrue=Ytrue) } Yhat.D2C<-unlist(lapply(FF,"[[",1) ) Yhat.GS<-unlist(lapply(FF,"[[",2)) Yhat.IAMB<-unlist(lapply(FF,"[[",3)) Ytrue<-unlist(lapply(FF,"[[",4)) ## computation of Balanced Error Rate BER.D2C<-BER(Ytrue,Yhat.D2C) BER.GS<-BER(Ytrue,Yhat.GS)
The average accuracy of the three algorithms (the smaller the BER the better) is shown by typing:
cat("\n BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n")
We invite the user to assess the impact of the number of training samples (e.g. by increasing the value NDAG
and retraining D2C) on the accuracy of the network construction.
Let us upload the alarm dataset and let us focus on a highly connected subnetwork of 100 nodes
data(alarm) graphTRUE<-true.net set.seed(0) observedData<-dataset w.const<-which(apply(observedData,2,sd)<0.1) if (length(w.const)>0){ observedData<-observedData[,-w.const] graphTRUE<-graphTRUE[-w.const,-w.const] } indn<-sort(apply(graphTRUE,1,sum)+apply(graphTRUE,2,sum),decr=TRUE,index=TRUE)$ix[1:100] observedData<-observedData[,indn] graphTRUE<-graphTRUE[indn,indn]
Let us first make the inference with two bnlearn
algorithms,
n<-NCOL(observedData) Ahat.GS<-amat(gs(data.frame(observedData))) Ahat.IAMB<-amat(iamb(data.frame(observedData),alpha=0.05))
and then with D2C...
Yhat.D2C<-NULL Yhat.GS<-NULL Yhat.IAMB<-NULL Ytrue<-NULL for (i in 1:n){ ## creation of a balanced test set ind1<-which(graphTRUE[i,]==1) ind0<-setdiff(setdiff(1:n,i),ind1) ind<-c(ind1,ind0[1:length(ind1)]) FF<-foreach(j=ind) %do%{ list(Yhat=as.numeric(predict(trainD2C,i,j,observedData)$response) -1, Yhat2=Ahat.GS[i,j],Yhat3=Ahat.IAMB[i,j],Ytrue=graphTRUE[i,j]) } Yhat.D2C<-c(Yhat.D2C,unlist(lapply(FF,"[[",1) )) Yhat.GS<-c(Yhat.GS,unlist(lapply(FF,"[[",2))) Yhat.IAMB<-c(Yhat.IAMB,unlist(lapply(FF,"[[",3))) Ytrue<-c(Ytrue,unlist(lapply(FF,"[[",4))) }
You can now print out the balanced error rates (BER) of the three techniques by using:
cat("\n BER.D2C=",BER(Ytrue,Yhat.D2C), "BER.GS=",BER(Ytrue,Yhat.GS), "BER.IAMB=",BER(Ytrue,Yhat.IAMB), "\n")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.