makeblocks: Creates strata of size 'bsize' aiming to minimize within...

Description Usage Arguments Details Value Usage Tricks Challenge Author(s) References Examples

View source: R/makeblocks.R

Description

From a distance matrix of n points (or from a data set, in which case the distances are calculated using the smahal function) and block size, bsize, this function outputs a stratification of the points of blocks of size bsize.

Usage

1
2
makeblocks(distmat, bsize, Itr = 30, .data = NULL, vars, maxItr = 200, 
					verbose = 0, ...)

Arguments

distmat

A numeric distance matrix of size n by n. Can be created using smahal function.

bsize

The size of each block.

Itr

Number of iterations of the randomization algorithm to be run, (initially). See 'Usage Tricks' section below.

.data

An optional data frame. Recommended.

vars

A character vector of column names/indices to be used to create the distance matrix. Ignored, if distmat is provided. If not provided all variables of .data are used.

maxItr

Maximum number of iterations of the randomization algorithm to be run, (initially). See 'Usage Tricks' section below.

verbose

0, 1 or 2 for the amount to detail to be printed during the run of the function.

...

Additional arguments. May be ignored.

Details

At the minimum the function requires distmat and bsize. In this case it will return a blocking structure as a floor(n/bsize) by bsize matrix. Each entry of the matrix will be a unit assigned to that block.

The argument .data of the data indices is not required but recommended for convenience. If .data is provided but not 'distmat', the distance matrix will be created by a call to smahal function to .data[,vars]. If both distmat and .data are provided then distmat is used for blocking and the strata index, from 1 to floor(n/bsize), for the data points are added as an extra column to the data set. This makes the diagnostics process simpler, see examples below.

IMPORTANT NOTE: In order to perform matching, makeblocks requires the user to load the optmatch (>= 0.9-1) package separately. The manual loading is required due to software license issues. If the package is not loaded the makeblocks command will fail with an error saying the 'optmatch' package is not present. Reference to 'optmatch' is given below.

Value

A list consisting of the following elements

strata

A blocking structure as a floor(n/bsize) by bsize matrix. Each entry being the units in that block. The unit names are based on the row names of .data or the row and column names of distmat, or 1:nrow(distmat). The rownames of strata are the pivot elements of the blocks.

.data

If .data is provided, then the same data set is returned with an extra column 'strata'. .data$strata are the strata index for the corresponding data point. The data points not assigned to any strata is indexed with a negative number.

Usage Tricks

The function depends on an iterative algorithm. The iterative algorithm is usually pretty smart and can figure out the number of iterations required to get close to the optimal solution with high probability.

To make the function make all decisions on its own, set maxItr to a very high value, or Inf, and do not specify Itr. If the problem is easy, it should automatically stop much earlier than maxItr.

For a more interactive use of the function one can start with a relatively small value of Itr, e.g., 20 or 30, and then it should prompt how many more iterations would be needed to maximize get close to the optimal solution. The user can choose how many more iterations to run. Some run log will be printed to further help the user make this decision, for more detailed logs use verbose=1 or verbose=2.

To preset the number of iterations, say 200, provide Itr=200 and maxItr=200.

Since the algorithm is a randomization algorithm, the outcomes will not be reproducible unless the seed is set, see set.seed.

Challenge

The pdf version of this manual shows the math equations clearly; the html version may be difficult to read.

The problem of finding an optimal stratification is NP-hard, in general. This algorithm combines multiple heuristics to solve the following problem.

Let d is a n by n matrix of distances, i.e. d[ij] >= 0, d[ij]=d[ji] and d[ij] <= d[ik]+d[kj], for all i, j, k. Fix integer b. Find B[1], ..., B[floor(n/b)], floor(n/b) exclusive subsets of {1,...,n} of size b such that it solves the following optimization problem.

min_{B[1], …, B[\lfloor(n/b)\rfloor]; B[l]\cap B[l']=\emptyset; |B[l]|=b; for all l,l'} ∑_{l=1}^{floor(n/b)} ∑_{i,j\in B[l]} d[ij].

The challenge has two versions.

(1)

Construct another algorithm that beats this algorithm across multiple problems, decided by n, bsize. Pathological examples can be insightful in this challenge.

(2)

Provide theoretical results. If you set maxItr={n \choose \lfloor(n/b)\rfloor} the solution will be a 2-approximation (with a small and fixable caveat). The heuristic algorithm believes that it can do not too bad with much less computation. How much worse is this heuristic algorithm? What would be a decent compromise, the number of iterations Itr, to balance computational burden and approximation error? Answers to these questions are useful both universally and in average cases.

Author(s)

Bikram Karmakar

References

Hansen, B. B. and Klopfer, S. O. (2006) Optimal full matching and related designs via network flows, JCGS 15 609–627.

Karmakar, B., Small, D. S. and Rosenbaum, P. R. (2018). Reinforced designs in observational studies of treatment effects: Multiple instruments plus controls as evidence factors.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
## IN THESE EXAMPLES YOU NEED TO LOAD THE R PACKAGE 
#  'optmatch'.  OTHERWISE, THEY WILL RUN IN A CODE CHECK MODE.
# library(optmatch)
###### 		Usage 1		##########
data(nh0506)

res <- makeblocks(nh0506, bsize=4, maxItr=15, 
         vars=c("female","age","black","hispanic","education","povertyr"))

		
## Not run: 
###### Some diagnostics
v = "age"
## Boxplots
temp = res$.data
temp[[paste0(v,'.instrata')]] <- temp[[v]] 
meanbystrat <- aggregate(temp[[v]], by =list(strata=temp$strata), mean)
for(b in unique(temp$strata)){
     who <- which( temp$strata == b )
     temp[[paste0(v,'.instrata')]][who] <- temp[[paste0(v,'.instrata')]][who] 
                        - meanbystrat[meanbystrat[,1]==b,2]
}

temp[[paste0(v,'.instrata')]] = temp[[paste0(v,'.instrata')]] + mean(temp[[v]])

boxplot(v~x,
data = data.frame(v=c(temp[[v]],temp[[paste0(v,'.instrata')]][temp$strata>0]), 
               x=c(rep('prestratification', nrow(temp)), 
                        rep('poststratification', sum(temp$strata>0)))))

## Anova
summary(aov(age~factor(strata), data=res$.data[res$.data$strata>0,]))

## End(Not run)


###### 		Usage 2		##########
distmat <- smahal(nh0506[,c("female","age","black","hispanic",
								"education","povertyr")])
res <- makeblocks(distmat, .data=nh0506, bsize=4, maxItr=15)


## Not run: 
# Other usages
## Internally calls 'smahal' to create the distances.
res <- makeblocks(nh0506[,c("female","age","black","hispanic",
                                           "education","povertyr")], 4)

## Returns the blocking structure in a matrix form, if data is not provided.
distmat <- smahal(nh0506[,c("female","age","black",
                                  "hispanic","education","povertyr")])
res <- makeblocks(distmat, bsize=4, maxItr=20)

## End(Not run)


###### 		Usage 3		##########
data(wls)

## Not run: 
data(wls)
library(optmatch)
wls4match <- wls

## This code replicates the blocking algorithm used in the paper
##	Karmakar, Small, and Rosenbaum (2018).

## Create the distance matrix

distmat1 <- smahal(wls4match[,"gwiiq_bm"]) 				## IQ

## Father's and mother's edu and parent's income
distmat2 <- smahal(wls4match[,c("edfa57q.NoNA", "edmo57q.NoNA", "bmpin1.NoNA",	
              ## Father's and mother's edu and parent's income
              "incg400", "incg250")])				
              ## Indicators for income in the top 5
			  
## occupation score
distmat2.2 <- smahal(wls4match[,c("ocsf57.NoNA", "ocpf57.NoNA")])		

## missing indicators
distmat3 <- smahal(wls4match[,c("edfa57q.miss", "edmo57q.miss", 	
              "bmpin1.miss", "ocsf57.miss", "ocpf57.miss")])

## The IQ = gwiiq_bm is given more weight.
##  parents' education and parent's income 
distmat = distmat1*10+6*distmat2+3*distmat2.2+2*distmat3

## creating the blocks.  This can take about 30min to run.
##	May take more time depending of the computation power of the system.
set.seed(0841)
res.20.2 = makeblocks(distmat, bsize=25, Itr=250, maxItr=250, .data=wls4match)

## End(Not run)

blockingChallenge documentation built on Sept. 22, 2018, 1:05 a.m.