D2C Vignette

The D2C algorithm

The D2C (Dependence to Causality) algorithm aims to infer the existence of causal dependencies from static (i.e. non temporal) observational and multivariate data.

The D2C algorithm predicts the existence of a direct causal link between two variables in a multivariate setting by (i) creating a set of of features of the relationship based on asymmetric descriptors of the multivariate dependency and (ii) using a classifier (in this case a Random Forest) to learn a mapping between the features and the presence of a causal link. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected.

This vignette shows how to create a training set for D2C, how to create a D2C object and how to assess it with a number of simulated datasets. A comparison with two inference algorithms implemented in the bnearn package is provided too.

Creation of training data

Let us start by creating a set of Directed Acyclic Graphs (DAGs) and the associated simulated datasets.

The object simulatedDAG can be initialized by setting the number NDAG of DAGs, the range noNodes of the number of nodes, the range N of observed samples, the type functionType of dependency and the range sdn of standard deviation of the additive noise.

rm(list=ls())
library(D2C)
require(RBGL)
require(gRbase)



noNodes<-c(10,20)
## range of number of nodes

N<-c(50,200)
## range of number of samples

sd.noise<-c(0.2,1)
## range of values for standard deviation of additive noise 

NDAG=100
## number of DAGs to be created and simulated


trainDAG<-new("simulatedDAG",NDAG=NDAG, N=N, noNodes=noNodes,
              functionType = c("linear","quadratic","sigmoid"), 
              seed=0,sdn=sd.noise,quantize=c(TRUE,FALSE),verbose=FALSE)

We can now check the number of DAGs

print(trainDAG@NDAG)

If it happens that the number is smaller than NDAG, this is due to the fact that DAGs with no edges are removed.

Let us visualize also the first simulated DAG and the dimension of the related datasets:

print(trainDAG@list.DAGs[[1]])
print(dim(trainDAG@list.observationsDAGs[[1]]))

Learning of a D2C object

Once we have a training set of DAGs, we can create a D2C object by storing for (a subset of) the edges of each DAG a vector of descriptors and the related binary label (0 or 1) about the existance of a causal link.

The parameters of the D2C descriptors are contained in a D2C.descriptor object

descr.example<-new("D2C.descriptor",bivariate=FALSE,ns=3,acc=TRUE,lin=TRUE)

trainD2C<-new("D2C",sDAG=trainDAG, 
              descr=descr.example,ratioEdges=1,max.features=30,verbose=FALSE)

Let us print some properties of the trainD2C object:

print(dim(trainD2C@X))
print(table(trainD2C@Y))

Let us print out the Random forest object stored within trainD2C

print(trainD2C@mod)

Creation of test data

Let us now create an independent test set (e.g. by creating
a second simulatedDAG object with a different seed of the random generator) and let us store it in the testDAG object:

NDAG.test=50
noNodes<-c(10,20)
## range of number of nodes

N<-c(50,100)

testDAG<-new("simulatedDAG",NDAG=NDAG.test, N=N, noNodes=noNodes,
             functionType = c("linear","quadratic","sigmoid"), quantize=c(TRUE,FALSE),
             seed=101,sdn=c(0.2,0.5),verbose=FALSE)

Assessment of accuracy

In this section we assess the accuracy of D2C and we compare it with two algorithms implemented in the package bnlearn. Accuracy is measured by the Balanced Error Rate (BER). For each dataset associated to the simulated DAG contained in testDAG, we infer the directionality of the edges and we compare it with the ground truth.

require(foreach)
if (!require(bnlearn)){
  install.packages("bnlearn", repos="http://cran.rstudio.com/")
  library(bnlearn)
  }
FF<-foreach (r=1:testDAG@NDAG) %do%{
  set.seed(r)
  observedData<-testDAG@list.observationsDAGs[[r]]
  trueDAG<-testDAG@list.DAGs[[r]]

  ## inference of networks with bnlearn package  
  Ahat.GS<-amat(gs(data.frame(observedData),alpha=0.01))
  Ahat.IAMB<-(amat(iamb(data.frame(observedData),alpha=0.01)))

  graphTRUE<- as.adjMAT(trueDAG) 


  ## selection of a balanced subset of edges for the assessment
  Nodes=nodes(trueDAG)
  max.edges<-min(30,length(edgeList(trueDAG)))
  subset.edges = matrix(unlist(sample(edgeList(trueDAG),
                                      size = max.edges,replace = F)),ncol=2,byrow = TRUE)  
  subset.edges = rbind(subset.edges,t(replicate(n =max.edges,
                                                sample(Nodes,size=2,replace = FALSE))))

  Yhat.D2C<-NULL
  Yhat.IAMB<-NULL
  Yhat.GS<-NULL
  Ytrue<-NULL
  for(jj in 1:NROW(subset.edges)){
    i =as(subset.edges[jj,1],"numeric"); 
    j =as(subset.edges[jj,2],"numeric") ;
    pred.D2C = predict(trainD2C,i,j, observedData)

    Yhat.D2C<-c(Yhat.D2C,as.numeric(pred.D2C$response)  -1)
    Yhat.IAMB<-c(Yhat.IAMB,Ahat.IAMB[i,j])
    Yhat.GS<-c(Yhat.GS,Ahat.GS[i,j])
    Ytrue<-c(Ytrue,graphTRUE[subset.edges[jj,1],subset.edges[jj,2]])
    }

  list(Yhat.D2C=Yhat.D2C,Yhat.GS=Yhat.GS,
       Yhat.IAMB=Yhat.IAMB,Ytrue=Ytrue)
  }

Yhat.D2C<-unlist(lapply(FF,"[[",1) )
Yhat.GS<-unlist(lapply(FF,"[[",2))
Yhat.IAMB<-unlist(lapply(FF,"[[",3))
Ytrue<-unlist(lapply(FF,"[[",4))
## computation of Balanced Error Rate
BER.D2C<-BER(Ytrue,Yhat.D2C)
BER.GS<-BER(Ytrue,Yhat.GS)

The average accuracy of the three algorithms (the smaller the BER the better) is shown by typing:

cat("\n BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n")

We invite the user to assess the impact of the number of training samples (e.g. by increasing the value NDAG and retraining D2C) on the accuracy of the network construction.

Assessment of accuracy with the alarm dataset

Let us upload the alarm dataset and let us focus on a highly connected subnetwork of 100 nodes

data(alarm)

graphTRUE<-true.net
set.seed(0)

observedData<-dataset

w.const<-which(apply(observedData,2,sd)<0.1)
if (length(w.const)>0){
  observedData<-observedData[,-w.const]
  graphTRUE<-graphTRUE[-w.const,-w.const]
  }
indn<-sort(apply(graphTRUE,1,sum)+apply(graphTRUE,2,sum),decr=TRUE,index=TRUE)$ix[1:100]
observedData<-observedData[,indn]
graphTRUE<-graphTRUE[indn,indn]

Let us first make the inference with two bnlearn algorithms,

n<-NCOL(observedData)


Ahat.GS<-amat(gs(data.frame(observedData)))
Ahat.IAMB<-amat(iamb(data.frame(observedData),alpha=0.05))

and then with D2C...

Yhat.D2C<-NULL
Yhat.GS<-NULL
Yhat.IAMB<-NULL
Ytrue<-NULL


for (i in 1:n){     
  ## creation of a balanced test set  
  ind1<-which(graphTRUE[i,]==1)
  ind0<-setdiff(setdiff(1:n,i),ind1)
  ind<-c(ind1,ind0[1:length(ind1)])

  FF<-foreach(j=ind) %do%{
     list(Yhat=as.numeric(predict(trainD2C,i,j,observedData)$response)  -1,
         Yhat2=Ahat.GS[i,j],Yhat3=Ahat.IAMB[i,j],Ytrue=graphTRUE[i,j])   
    }
  Yhat.D2C<-c(Yhat.D2C,unlist(lapply(FF,"[[",1) ))
  Yhat.GS<-c(Yhat.GS,unlist(lapply(FF,"[[",2)))
  Yhat.IAMB<-c(Yhat.IAMB,unlist(lapply(FF,"[[",3)))
  Ytrue<-c(Ytrue,unlist(lapply(FF,"[[",4)))
  }

You can now print out the balanced error rates (BER) of the three techniques by using:

cat("\n BER.D2C=",BER(Ytrue,Yhat.D2C), "BER.GS=",BER(Ytrue,Yhat.GS),
    "BER.IAMB=",BER(Ytrue,Yhat.IAMB),
    "\n") 


Try the D2C package in your browser

Any scripts or data that you put into this service are public.

D2C documentation built on May 29, 2017, 10:44 a.m.