Description Usage Arguments Details Value Usage Tricks Challenge Author(s) References Examples
From a distance matrix of n
points (or from a data set, in
which case the distances are calculated using the smahal
function) and
block size, bsize
, this function outputs a stratification of the points
of blocks of size bsize
.
1 2 | makeblocks(distmat, bsize, Itr = 30, .data = NULL, vars, maxItr = 200,
verbose = 0, ...)
|
distmat |
A numeric distance matrix of size |
bsize |
The size of each block. |
Itr |
Number of iterations of the randomization algorithm to be run, (initially). See 'Usage Tricks' section below. |
.data |
An optional data frame. Recommended. |
vars |
A character vector of column names/indices to be used to create the distance
matrix. Ignored, if |
maxItr |
Maximum number of iterations of the randomization algorithm to be run, (initially). See 'Usage Tricks' section below. |
verbose |
0, 1 or 2 for the amount to detail to be printed during the run of the function. |
... |
Additional arguments. May be ignored. |
At the minimum the function requires distmat
and bsize
. In
this case it will return a blocking structure as a floor(n/bsize)
by bsize
matrix. Each entry of the matrix will be a unit assigned
to that block.
The argument .data of the data indices is not required but recommended for
convenience. If .data is provided but not 'distmat', the distance matrix will be
created by a call to smahal
function to .data[,vars]
. If both
distmat
and .data
are provided then distmat
is used for
blocking and the strata index, from 1 to floor(n/bsize)
, for the data
points are added as an extra column to the data set. This makes the
diagnostics process simpler, see examples below.
IMPORTANT NOTE: In order to perform matching, makeblocks
requires the
user to load the optmatch (>= 0.9-1) package separately. The manual loading is
required due to software license issues. If the package is not loaded the
makeblocks
command will fail with an error saying the 'optmatch' package
is not present. Reference to 'optmatch' is given below.
A list consisting of the following elements
strata |
A blocking structure as a |
.data |
If |
The function depends on an iterative algorithm. The iterative algorithm is usually pretty smart and can figure out the number of iterations required to get close to the optimal solution with high probability.
To make the function make all decisions
on its own, set maxItr
to a very high value, or Inf, and do not specify Itr
.
If the problem is easy, it should automatically stop much earlier than maxItr
.
For a more interactive use of the function one can start with a relatively
small value of Itr
, e.g., 20 or 30, and then it should prompt how many more
iterations would be needed to maximize get close to the optimal solution. The
user can choose how many more iterations to run. Some run log will be printed
to further help the user make this decision, for more detailed logs use
verbose=1
or verbose=2
.
To preset the number of iterations, say 200, provide Itr=200
and
maxItr=200
.
Since the algorithm is a randomization algorithm, the outcomes will not be
reproducible unless the seed is set, see set.seed
.
The pdf version of this manual shows the math equations clearly; the html version may be difficult to read.
The problem of finding an optimal stratification is NP-hard, in general. This algorithm combines multiple heuristics to solve the following problem.
Let d is a n
by n
matrix of distances, i.e. d[ij] >= 0
,
d[ij]=d[ji]
and d[ij] <= d[ik]+d[kj]
, for all i, j, k
. Fix
integer b
. Find B[1], ..., B[floor(n/b)]
, floor(n/b)
exclusive subsets of {1,...,n}
of size b
such that it solves
the following optimization problem.
min_{B[1], …, B[\lfloor(n/b)\rfloor]; B[l]\cap B[l']=\emptyset; |B[l]|=b; for all l,l'} ∑_{l=1}^{floor(n/b)} ∑_{i,j\in B[l]} d[ij].
The challenge has two versions.
Construct another algorithm that beats this algorithm across multiple problems,
decided by n
, bsize
. Pathological examples can be insightful in
this challenge.
Provide theoretical results. If you set maxItr=
{n \choose \lfloor(n/b)\rfloor}
the solution will be a 2-approximation (with a small and fixable caveat). The
heuristic algorithm believes that it can do not too bad with much less
computation. How much worse is this heuristic algorithm? What would be a decent
compromise, the number of iterations Itr
, to balance computational burden
and approximation error? Answers to these questions are useful both universally
and in average cases.
Bikram Karmakar
Hansen, B. B. and Klopfer, S. O. (2006) Optimal full matching and related designs via network flows, JCGS 15 609–627.
Karmakar, B., Small, D. S. and Rosenbaum, P. R. (2018). Reinforced designs in observational studies of treatment effects: Multiple instruments plus controls as evidence factors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | ## IN THESE EXAMPLES YOU NEED TO LOAD THE R PACKAGE
# 'optmatch'. OTHERWISE, THEY WILL RUN IN A CODE CHECK MODE.
# library(optmatch)
###### Usage 1 ##########
data(nh0506)
res <- makeblocks(nh0506, bsize=4, maxItr=15,
vars=c("female","age","black","hispanic","education","povertyr"))
## Not run:
###### Some diagnostics
v = "age"
## Boxplots
temp = res$.data
temp[[paste0(v,'.instrata')]] <- temp[[v]]
meanbystrat <- aggregate(temp[[v]], by =list(strata=temp$strata), mean)
for(b in unique(temp$strata)){
who <- which( temp$strata == b )
temp[[paste0(v,'.instrata')]][who] <- temp[[paste0(v,'.instrata')]][who]
- meanbystrat[meanbystrat[,1]==b,2]
}
temp[[paste0(v,'.instrata')]] = temp[[paste0(v,'.instrata')]] + mean(temp[[v]])
boxplot(v~x,
data = data.frame(v=c(temp[[v]],temp[[paste0(v,'.instrata')]][temp$strata>0]),
x=c(rep('prestratification', nrow(temp)),
rep('poststratification', sum(temp$strata>0)))))
## Anova
summary(aov(age~factor(strata), data=res$.data[res$.data$strata>0,]))
## End(Not run)
###### Usage 2 ##########
distmat <- smahal(nh0506[,c("female","age","black","hispanic",
"education","povertyr")])
res <- makeblocks(distmat, .data=nh0506, bsize=4, maxItr=15)
## Not run:
# Other usages
## Internally calls 'smahal' to create the distances.
res <- makeblocks(nh0506[,c("female","age","black","hispanic",
"education","povertyr")], 4)
## Returns the blocking structure in a matrix form, if data is not provided.
distmat <- smahal(nh0506[,c("female","age","black",
"hispanic","education","povertyr")])
res <- makeblocks(distmat, bsize=4, maxItr=20)
## End(Not run)
###### Usage 3 ##########
data(wls)
## Not run:
data(wls)
library(optmatch)
wls4match <- wls
## This code replicates the blocking algorithm used in the paper
## Karmakar, Small, and Rosenbaum (2018).
## Create the distance matrix
distmat1 <- smahal(wls4match[,"gwiiq_bm"]) ## IQ
## Father's and mother's edu and parent's income
distmat2 <- smahal(wls4match[,c("edfa57q.NoNA", "edmo57q.NoNA", "bmpin1.NoNA",
## Father's and mother's edu and parent's income
"incg400", "incg250")])
## Indicators for income in the top 5
## occupation score
distmat2.2 <- smahal(wls4match[,c("ocsf57.NoNA", "ocpf57.NoNA")])
## missing indicators
distmat3 <- smahal(wls4match[,c("edfa57q.miss", "edmo57q.miss",
"bmpin1.miss", "ocsf57.miss", "ocpf57.miss")])
## The IQ = gwiiq_bm is given more weight.
## parents' education and parent's income
distmat = distmat1*10+6*distmat2+3*distmat2.2+2*distmat3
## creating the blocks. This can take about 30min to run.
## May take more time depending of the computation power of the system.
set.seed(0841)
res.20.2 = makeblocks(distmat, bsize=25, Itr=250, maxItr=250, .data=wls4match)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.