Parsing POS Layout Files for Descriptive Names

This vignette will show how a Provider of Services report layout file can be quickly parsed to extract the descriptive variable names it contains. POS datasets from year 2010 and earlier have generic variable names like PROV0001, PROV0002, ... that offer no insight into what the variable actually is. In the Layout file, along with a data dictionary explaining the variable's values, there is also a COBOL descriptive name. The pos_names_extract() function will parse this file and return the descriptive names, in the order that matches the variables in the dataset.

Provider of Services Data

I have included a sample of the 2010 Provider of Services data for hospices. The full 2010 file (along with many other years) is available from the NBER and contains data from other provider types as well.

library(medicare)
# load the package data
data(pos2010, package = "medicare")
names(pos2010)[1:10]

These variable names are useless, and with over 500 variables it is impractical to look up each one. Instead, we can parse the layout file to obtain useful names. In this example, I have bundled the Layout 2010 file with this package, but I expect the user to have the downloaded text file that corresponds to each dataset in use.

# filepath should be changed by user
filepath <- system.file("extdata", "layout10.txt", package = "medicare")
names_2010 <- pos_names_extract(filepath, pos2010)
names_2010[1:10]

These are much more descriptive variable names and worth using.

pos2010_renamed <- pos2010
names(pos2010_renamed) <- names_2010

Note that it is up to the user to make sure that the layout file is appropriate for the chosen data file. Each year's layout file is different, so each year must be parsed separately. The function checks whether the number of variables in the layout file and dataset match and whether the generic variable names are the same in both. It will stop if there's a problem. If the generic names from dataset 20XX are the same as in layout 20YY, the parsing should work, but won't necessarily be accurate. CMS is not 100% consistent with variable naming across years.

pos2010_short <- pos2010[, 1:500]
names_2010_short <- pos_names_extract(filepath, pos2010_short)
pos2010_wrong_names <- pos2010
names(pos2010_wrong_names)[1:3] <- c("wrong1", "wrong2", "wrong3")
names_2010_wrong_names <- pos_names_extract(filepath, pos2010_wrong_names)

Pre-compiled dataset names

In order to same the user time and headaches of downloading each year's Layout file, I have pre-compiled dataset names for years 2000-2010. These can be accessed via the pos_names() function. By looking at inner variables, this also illustrates how the dataset layouts change over time:

for (year in 2000:2010) {
  print(year)
  print(pos_names(year)[200:205])
}


Try the medicare package in your browser

Any scripts or data that you put into this service are public.

medicare documentation built on May 1, 2019, 10:19 p.m.