The D2C algorithm

The D2C (Dependence to Causality) algorithm aims to infer the existence of causal dependencies from static (i.e. non temporal) observational and multivariate data.

The D2C algorithm predicts the existence of a direct causal link between two variables in a multivariate setting by (i) creating a set of of features of the relationship based on asymmetric descriptors of the multivariate dependency and (ii) using a classifier (in this case a Random Forest) to learn a mapping between the features and the presence of a causal link. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected.

This vignette shows how to create a training set for D2C, how to create a D2C object and how to assess it with a number of simulated datasets. A comparison with two inference algorithms implemented in the bnearn package is provided too.

Creation of training data

Let us start by creating a set of Directed Acyclic Graphs (DAGs) and the associated simulated datasets.

The object simulatedDAG can be initialized by setting the number NDAG of DAGs, the range noNodes of the number of nodes, the range N of observed samples, the type functionType of dependency and the range sdn of standard deviation of the additive noise.

rm(list=ls())
library(D2C)
require(RBGL)
require(gRbase)
require(igraph)
require(graph)

set.seed(0)

noNodes<-c(50,100)
## range of number of nodes

N<-c(100,200)
## range of number of samples

sd.noise<-c(0.2,0.5)
## range of values for standard deviation of additive noise 

NDAG=20
## number of DAGs to be created and simulated


trainDAG<-new("simulatedDAG",NDAG=NDAG, N=N, noNodes=noNodes,
              functionType = c("linear","quadratic","sigmoid"), 
              seed=0,sdn=sd.noise,quantize=c(TRUE,FALSE),verbose=TRUE)

We can now check the number of DAGs

print(trainDAG@NDAG)

If it happens that the number is smaller than NDAG, this is due to the fact that DAGs with no edges are removed.

Let us visualize also the first simulated DAG and the dimension of the related datasets:

print(trainDAG@list.DAGs[[1]])
print(dim(trainDAG@list.observationsDAGs[[1]]))

Learning of a D2C object

Once we have a training set of DAGs, we can create a D2C object by storing for (a subset of) the edges of each DAG a vector of descriptors and the related binary label (0 or 1) about the existance of a causal link.

The parameters of the D2C descriptors are contained in a D2C.descriptor object

descr.example<-new("D2C.descriptor",bivariate=FALSE,ns=5,acc=TRUE,lin=FALSE)

trainD2C<-new("D2C",sDAG=trainDAG, 
              descr=descr.example,ratioEdges=0.1,max.features=30,verbose=TRUE)


trainD2C<-makeModel(trainD2C,classifier="RF",EErep=2)

Let us print some properties of the trainD2C object:

print(dim(trainD2C@origX))
print(table(trainD2C@Y))

Creation of test data

Let us now create an independent test set (e.g. by creating
a second simulatedDAG object with a different seed of the random generator) and let us store it in the testDAG object:

NDAG.test=10


testDAG<-new("simulatedDAG",NDAG=NDAG.test, N=N, noNodes=noNodes,
             functionType = c("linear","quadratic","sigmoid"), 
             quantize=c(TRUE,FALSE),
             seed=101,sdn=sd.noise,verbose=TRUE)

Assessment of accuracy

In this section we assess the accuracy of D2C and we compare it with two algorithms implemented in the package bnlearn. Accuracy is measured by the Balanced Error Rate (BER). For each dataset associated to the simulated DAG contained in testDAG, we infer the directionality of the edges and we compare it with the ground truth.

  if (!require(bnlearn)){
    install.packages("bnlearn", repos="http://cran.rstudio.com/")
    library(bnlearn)
    }
Yhat.D2C<-NULL
Yhat.IAMB<-NULL
Yhat.GS<-NULL
Ytrue<-NULL
  for (r in 1:testDAG@NDAG){
    set.seed(r)
    observedData<-testDAG@list.observationsDAGs[[r]]
    trueDAG<-testDAG@list.DAGs[[r]]

    ## inference of networks with bnlearn package
    Ahat.GS<-amat(gs(data.frame(observedData),alpha=0.01,max.sx=3))
    Ahat.IAMB<-(amat(iamb(data.frame(observedData),alpha=0.01,max.sx=3)))

    graphTRUE<- as.adjMAT(trueDAG)


    ## selection of a balanced subset of edges for the assessment
    Nodes=nodes(trueDAG)
    max.edges<-min(10,length(edgeList(trueDAG)))
    subset.edges = matrix(unlist(sample(edgeList(trueDAG),
                                        size = max.edges,replace = F)),ncol=2,byrow = TRUE)
    subset.edges = rbind(subset.edges,t(replicate(n =max.edges,
                                                  sample(Nodes,size=2,replace = FALSE))))


    for(jj in 1:NROW(subset.edges)){
      i =as(subset.edges[jj,1],"numeric");
      j =as(subset.edges[jj,2],"numeric") ;
      pred.D2C = predict(trainD2C,i,j, observedData,rep=3)

      Yhat.D2C<-c(Yhat.D2C,as.numeric(pred.D2C$response))
      Yhat.IAMB<-c(Yhat.IAMB,Ahat.IAMB[i,j])
      Yhat.GS<-c(Yhat.GS,Ahat.GS[i,j])
      Ytrue<-c(Ytrue,graphTRUE[subset.edges[jj,1],subset.edges[jj,2]])
      ## computation of Balanced Error Rate
      BER.D2C<-BER(Ytrue,Yhat.D2C)
      BER.GS<-BER(Ytrue,Yhat.GS)
      BER.IAMB<-BER(Ytrue,Yhat.IAMB)

      ## ----echo=TRUE, eval=FALSE-----------------------------------------------
      cat("\n  DAG", r, ",", length(Ytrue), " edges tested: BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n")
      }
  }

The average accuracy of the three algorithms (the smaller the BER the better) is shown by typing:

cat("\n BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n")

We invite the user to assess the impact of the number of training samples (e.g. by increasing the value NDAG and retraining D2C) on the accuracy of the network construction.

Assessment of accuracy with the alarm dataset

Let us upload the alarm dataset and let us focus on a highly connected subnetwork of 100 nodes

data(alarm)

graphTRUE<-true.net
set.seed(0)

observedData<-dataset

w.const<-which(apply(observedData,2,sd)<0.1)
if (length(w.const)>0){
  observedData<-observedData[,-w.const]
  graphTRUE<-graphTRUE[-w.const,-w.const]
  }
indn<-sort(apply(graphTRUE,1,sum)+apply(graphTRUE,2,sum),decr=TRUE,index=TRUE)$ix[1:100]
observedData<-observedData[,indn]
graphTRUE<-graphTRUE[indn,indn]

Let us first make the inference with two bnlearn algorithms,

n<-NCOL(observedData)


Ahat.GS<-amat(gs(data.frame(observedData)))
Ahat.IAMB<-amat(iamb(data.frame(observedData),alpha=0.05))

and then with D2C...

require(doParallel)
Yhat.D2C<-NULL
Yhat.GS<-NULL
Yhat.IAMB<-NULL
Ytrue<-NULL


for (i in 1:n){     
  ## creation of a balanced test set  
  ind1<-which(graphTRUE[i,]==1)
  ind0<-setdiff(setdiff(1:n,i),ind1)
  ind<-c(ind1,ind0[1:length(ind1)])

  FF<-foreach(j=ind) %do%{
     list(Yhat=predict(trainD2C,i,j,observedData,rep=3)$response,
         Yhat2=Ahat.GS[i,j],Yhat3=Ahat.IAMB[i,j],Ytrue=graphTRUE[i,j])   
    }
  Yhat.D2C<-c(Yhat.D2C,unlist(lapply(FF,"[[",1) ))
  Yhat.GS<-c(Yhat.GS,unlist(lapply(FF,"[[",2)))
  Yhat.IAMB<-c(Yhat.IAMB,unlist(lapply(FF,"[[",3)))
  Ytrue<-c(Ytrue,unlist(lapply(FF,"[[",4)))
  }

You can now print out the balanced error rates (BER) of the three techniques by using:

cat("\n BER.D2C=",BER(Ytrue,Yhat.D2C), "BER.GS=",BER(Ytrue,Yhat.GS),
    "BER.IAMB=",BER(Ytrue,Yhat.IAMB),
    "\n") 


gbonte/D2C documentation built on Sept. 8, 2023, 11:22 p.m.