build.panel: Build PSID panel data set

Share:

Description

Builds a panel data set in wide format with id variables pid (unique person identifier) and year from individual PSID family files.

Usage

1
2
3
build.panel(datadir = NULL, fam.vars, ind.vars = NULL, SAScii = FALSE,
  heads.only = FALSE, sample = NULL, design = "balanced",
  verbose = FALSE)

Arguments

datadir

either NULL, in which case saves to tmpdir or path to directory containing family files ("FAMyyyy.xyz") and individual file ("IND2009ER.xyz") in admissible formats .xyz. Admissible are .dta, .csv, .RData, .rda. Please follow naming convention. Only .dta version <= 12 supported. Recommended usage is to specify datadir.

fam.vars

data.frame of variable to retrieve from family files. Can contain see example for required format.

ind.vars

data.frame of variables to get from individual file. In almost all cases this will be the type of survey weights you want to use. don't include id variables ER30001 and ER30002.

SAScii

logical TRUE if you want to directly download data into Rda format (no dependency on STATA/SAS/SPSS). may take a long time, but downloads only once if you specify datadir.

heads.only

logical TRUE if user wants current household heads only.

sample

string indicating which sample to select: "SRC" (survey research center), "SEO" (survey for economic opportunity), "immigrant" (immigrant sample), "latino" (Latino family sample). Defaults to NULL, so no subsetting takes place.

design

either character balanced or all or integer. balanced means only individuals who appear in each wave are considered. All means all are taken. An integer value stands for minimum consecutive years of participation, i.e. design=3 means present in at least 3 consecutive waves.

verbose

logical TRUE if you want verbose output.

Details

takes desired variables from family files for specified years in folder datadir and merges using the id information in IND2013ER.xyz, which must be in the same directory. The raw data can be supplied in stata .dta format or it can be directly downloaded from the PSID server to folders datadir or tmpdir. Notice that currently only stata format <= 12 is supported (so do saveold in stata). The user can change subsetting criteria as well as sample designs. The package allows the missing variables in certain waves to be accounted for automatically, i.e. the variables are inserted in the missing year as NA. Merge: the variables interview number in each family file map to the interview number variable of a given year in the individual file. Run example(build.panel) for a demonstration. Accepted input data are stata format .dta, .csv files or R data formats .rda and RData. Similar in usage to stata module psiduse.

Value

data

resulting data.table. the variable pid is the unique person identifier, constructed from ID1968 and pernum.

dict

data dictionary if stata data was supplied, NULL else

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
## Not run: 
# ################################################
# Real-world example: not run because takes long.
# Build panel with income, wage, age and education
# ################################################

r = system.file(package="psidR")
f = fread(file.path(r,"psid-lists","famvars.txt"))
i = fread(file.path(r,"psid-lists","indvars.txt"))

f[1:38,vgroup := "wage"]
f[39:76,vgroup := "earnings"]
setkey(f,vgroup)

i[1:38, vgroup := "age"]
i[39:76, vgroup := "educ"]  # caution about 2 first years: no educ data
i[77:114, vgroup := "weight"]
setkey(i,vgroup)

ind = cbind(i[J("age"),list(year,age=variable)],
			   i[J("educ"),list(educ=variable)],
			   i[J("weight"),list(weight=variable)])
fam = cbind(f[J("wage"),list(year,wage=variable)],
			   f[J("earnings"),list(earnings=variable)])

# caution: this step will take many hours
d = build.panel(datadir="~/data",
                fam.vars=fam,
				   ind.vars=ind,
                SAScii = TRUE, 
                heads.only = TRUE,
                sample="SRC",
                design=2)

## End(Not run) 

# ######################################
# reproducible example on artifical data. 
# run this with example(build.panel).
# ######################################

## make reproducible family data sets for 2 years
## variables are: family income (Money) and age

## Data acquisition step: you download data or
## run build.panel with sascii=TRUE

# testPSID creates artifical PSID data
td <- testPSID(N=12,N.attr=0)
fam1985 <- copy(td$famvars1985)
fam1986 <- copy(td$famvars1986)
IND2009ER <- copy(td$IND2009ER)

# create a temporary datadir
my.dir <- tempdir()
#save those in the datadir
# notice different R formats admissible
save(fam1985,file=paste0(my.dir,"/FAM1985ER.rda"))
save(fam1986,file=paste0(my.dir,"/FAM1986ER.RData"))
save(IND2009ER,file=paste0(my.dir,"/IND2009ER.RData"))

## end Data acquisition step.

# now define which famvars
famvars <- data.frame(year=c(1985,1986),
                      money=c("Money85","Money86"),
                      age=c("age85","age86"))

# create ind.vars
indvars <- data.frame(year=c(1985,1986),ind.weight=c("ER30497","ER30534"))

# call the builder
# data will contain column "relation.head" holding the relationship code.

d <- build.panel(datadir=my.dir,fam.vars=famvars,
                 ind.vars=indvars,
                 heads.only=FALSE,verbose=TRUE)	

# see what happens if we drop non-heads
# only the ones who are heads in BOTH years 
# are present (since design='balanced' by default)
d <- build.panel(datadir=my.dir,fam.vars=famvars,
                 ind.vars=indvars,
                 heads.only=TRUE,verbose=FALSE)	
print(d$data[order(pid)],nrow=Inf)

# change sample design to "all": 
# we'll keep individuals if they are head in one year,
# and drop in the other
d <- build.panel(datadir=my.dir,fam.vars=famvars,
                 ind.vars=indvars,heads.only=TRUE,
                 verbose=FALSE,design="all")	
print(d$data[order(pid)],nrow=Inf)

file.remove(paste0(my.dir,"/FAM1985ER.rda"),
            paste0(my.dir,"/FAM1986ER.RData"),
            paste0(my.dir,"/IND2009ER.RData"))

# END psidR example

# #####################################################################
# Please go to https://github.com/floswald/psidR for more example usage
# #####################################################################