A function to learn a DAG model by the hill climbing algorithm

Description

This function is to efficiently implement the hill climbing algorithm for learning a DAG by optimizing a score. It can also build an emsemble of DAGs based on bootstrap resamples of the data.

Usage

1
2
3
4
score(Y, n.boot=0, score.type="BIC", threshold=0, max.step=500,  ini.adj.matrix=NULL, 
blacklist=NULL, whitelist=NULL, standardize=TRUE,  standardize.boot=TRUE, 
random.forest=FALSE, random.step.length=NULL, 
nrestart=0, perturb=0, shuffle=FALSE, print=FALSE, EPS=1e-06)

Arguments

Y

an n by p data matrix: n – sample size, p – number of variables

n.boot

an integer: the number of bootstrap resamples of the data matrix Y, default = 0, meaning no bootstrapping

score.type

a string: "BIC" or "likelihood", default = "BIC"

threshold

a nonnegative scalar: the cutoff value for the change of the score to decide whether to stop the search; default = 0, meaning stop search when score is not improved

max.step

an integer: the maximum number of search steps of the hill climbing algorithm, default = 500

ini.adj.matrix

a p by p 0-1 matrix: the initial graph, default = NULL, meaning the empty graph

blacklist

a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will be excluded from the DAG during the search, default= NULL

whitelist

a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will always be included in the DAG during the search, default=NULL

standardize

logic: whether to standardize the data to have mean zero and sd one, default = TRUE

standardize.boot

logic: whether to standardize the bootstrap resamples, default = TRUE

random.forest

logic: whether to use the "random forest" idea for further variance reduction, default=FALSE

random.step.length

a vector: specify “random forest" steps

nrestart

an integer: number of times to restart the search algorithm after a local optimal is achieved. The purpose is to search for global optimal, default=0, meaning no restart.

perturb

an integer: how many random addition/deletion/reversal operations should be used in each random restart, default = 0, corresponding to no restart.

shuffle

logic: whether to shuffle the order of variables before DAG learning. The purpose is to avoid potential systematic biases in simulation studies (due to possible coincidence of the topological ordering and the nominal ordering of the variables), default=FALSE

print

logic: whether print the step information, default= FALSE

EPS

a scalar: a number to indicate what small values will be treated as zero, default = 1e-06, meaning that values with magnitude smaller than 1e-6 will be treated as zero.

Details

score implements the hill climbing algorithm. It can be used to build an ensemble of DAGs (in form of adjacency matrices) based on bootstrap resamples of the data.

Value

If n.boot=0 (no bootstrap), then a list with five components:

adj.matrix

adjacency matrix for the estimated DAG

final.step

a number recording how many search steps are conducted before the procedure stops

movement

a matrix recording the selected operation, \ addition, deletion or reversal of an edge, at each search step

delta.min

a vector recording the change of score due to the selected operation at each search step

bic.ok

a vector recording whether there is error in score calculation at each step

If n.boot>0 (with bootstrap) then a list with one component:

adj.matrix

adjacency matrices for the estimated DAGs based on n.boot bootstrap resamples

Author(s)

Wang, R and Peng, J.

References

Wang, R. and Peng, J. (2013) Learning directed acyclic graphs via bootstrap aggregating. arXiv:1406.2098

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  data(example)
  Y.n=example$Y  # data matrix
  true.dir=example$true.dir  # adjacency matrix of the data generating DAG
  true.ske=example$true.ske  # skeleton graph of the data generating DAG
  
  
  temp=score(Y=Y.n, n.boot=0, score.type="BIC") ## learn DAG using "BIC" score
  adj=temp$adj.matrix
  
  sum(adj)  # number of edges 
  
  
  # Find the total number of skeleton edges and the number of correct skeleton edges 
  # from the adjacency matrix  of an estimated DAG.
  # Then compared with the true.ske.
  
    diag(adj)=0
    tt=adj+t(adj) ##symmetrization
    correct.c=sum((tt>0)&(true.ske>0))/2    
    total.c=sum(tt>0)/2
    total.true=sum(true.ske>0)/2
    c(total.c, correct.c, total.true)

 # Build an ensemble of DAGs based on bootstrap resamples of the data 
  temp.boot=score(Y.n, n.boot=10, score.type="BIC")
  boot.adj=temp.boot$adj.matrix