importSAS: importSAS

View source: R/importSAS.R

importSASR Documentation

importSAS

Description

Selective import of sas7bdat files into R data.table format

Usage

importSAS(filename,wd=NULL,keep=NULL,drop = NULL,
                 where = NULL,obs = NULL,filter = NULL,
                 filter.by = NULL, filter.negative = FALSE,
                 set.hook=NULL,step.hook=NULL,pre.hook=NULL,
                 post.hook=NULL,savefile = NULL,overwrite = TRUE,
                 show.sas.code=FALSE,save.tmp = FALSE,content=FALSE,
                 na.strings="dot",date.vars=NULL,datetime.vars=NULL,
                 character.vars="pnr", numeric.vars = NULL,sas.program,
                 sas.switches, sas.runner, use.colClasses=TRUE,
                 skip.date.conversion=FALSE, force.numeric=TRUE,
                 sas.data.extension="sas7bdat", verbose=FALSE,...)
       contentSAS(filename,wd=NULL)

Arguments

filename

The filename (with full path) of the SAS dataset to import. So, "x:/data/rawdata/701111/lmdb.sas7bdat" and also "v:/data/workdata/701111/MeMe/project1/data/mydata.sas7bdat" work but "./data/mydata.sas7bdat" does not even not if the working directory is set to "v:/data/workdata/701111/MeMe/project1/".

wd

The directory used to store temporarily created files (SAS script, log file, csv file). You need to have permission to write to this directory. The default value is the current working directory, see getwd(), (which you may not have access to write to!). On Gentofte's Danmark Statistics servers it may help to set the working directory to the fast X drive.

keep

A vector of variable names, i.e., the variables (columns) to read and keep from the dataset. Default is to read and keep all variables.

drop

Specifies the variables (columns) to leave out from the dataset. Default is to leave out no variables.

where

Specifies which conditions the observations (rows) from the dataset should fulfil. Default is no conditions. Use SAS syntax (see examples).

obs

Number of observations to read from the dataset. Setting this to Inf has the same effect as not setting it, i.e, read all observations.

filter

Alternative or in addition to the where statement it is possible to filter the rows of filename based on a data.table. E.g., filter can be a data.table with one column consisting of *unique* PNRs to specify that only the matching rows should be imported from filename.

filter.by

Vector of variable names to filter by. By default all variables present in the filter data are used.

filter.negative

Vector of length two that defines how to merge the filter data with the SAS dataset filename:

  • c(1,1) means observation are only included if they are present both the filter data and the SAS data.

  • c(-1,1) means observation are only included if they are not present filter data but present in the SAS data.

set.hook

Quoted SAS statments (within use single quotes) to be placed in addition to set options (where, keep, drop, obs) when setting the data set filename. See examples.

step.hook

Quoted SAS statments (within use single quotes) to be placed after setting the data set filename. See examples.

pre.hook

Quoted SAS code (within use single quotes) to be set in the beginning of the SAS program. For example, it maybe useful to specify options such as 'options obs=100;' in combination with a where statement.

post.hook

Quoted SAS code (within use single quotes) to be set at the end of the SAS program. For example, it maybe useful to specify 'proc print data=df;' in case of trouble.

savefile

If specified, the generated csv file will be saved with the given name. The name should end with ".csv". The file will be saved in the working directory or in the directory given under wd if this is specified.

overwrite

Logical. Determines whether or not to overwrite files already existing with the same name as files generated by this function. This is the temporary SAS file, log file and csv file, and possibly a permanent file with the name given under "savefile". If the value is FALSE and some files already exist, the function will abort and print the name of the problematic files. The default value is TRUE.

show.sas.code

Logical. If TRUE show sas code in R console before running it.

save.tmp

Logical. Option to save all temporary files. Even though this is set to FALSE, the csv file will be saved if there is given a filename in "savefile". Default value is FALSE.

content

Logical. If true, the function will only read and return the content of the import file. Together with save.tmp=TRUE, this can be used to generate the SAS file without running it.

na.strings

A vector of strings to interpret values of character variables as NA. Each element should be a regular expression that can be understood by grepl. For example, the default value "^\.$" matches fields that contain a single dot and nothing else than a dot.

date.vars

Vector of variables to be converted to date variables. For these variables a SAS format statement yymmdd10. is used to force the correct order of year months and days and the conversion is done with lubridate::ymd. Conversion can be skipped with argument skip.date.conversion.

datetime.vars

Vector of variables to be converted to datetime variables. For these variables a SAS format statement datetime. is used and the conversion is done with lubridate::dmy_hms. Conversions in both SAS and R programs can be skipped with argument skip.date.conversion.

character.vars

names of variables that should be converted to character. Case does not matter. Default is "pnr".

numeric.vars

character.vars names of variables that should be converted to numeric. case does not matter.

sas.program

sas program. On linux where .Platform$OS.type=="unix" this defaults to "sas" on any other system to "C:/Program Files/SASHome/SASFoundation/9.4/sas.exe"

sas.switches

On linux this defaults to "" on any other system to "-batch -nosplash -noenhancededitor -sysin"

sas.runner

How sas is invoked. On linux this defaults to "system" on any other system to "shell".

use.colClasses

Logical. If TRUE learn about the variable types from SAS's proc contents and pass to fread as argument colClasses.

skip.date.conversion

if TRUE do not try to convert any dates or datetime variables. If "SAS" only skip the format statements but format via lubridate (see date.vars and datetime.vars). If "R" skip lubridate conversion but keep the format statements.

force.numeric

if TRUE force numeric format on numeric variables specified by argument numeric.vars.

sas.data.extension

String to be checked against the file extenstion of filename. Default is "sas7bdat".

verbose

Logical. When TRUE warnings and errors are shown, otherwise not shown. This can be useful to turn on when there are problems with the result.

...

Arguments passed to fread for reading the created .csv file. OBS: try to avoid specifying colClasses and instead use the arguments of importSAS: character.vars, date.vars, datetime.vars and numeric.vars.

Details

This function first runs SAS proc contents in the background and uses the result to write temporary SAS code that writes the data in csv format to a temporary file. The data are then read with data.table::freads into R. The function tries to help with the formatting of numeric and date variables and with the understanding of missing values. However, in some cases this help is counterproductive and can therefore be turned off, to some extent at least, with arguments na.strings, force.numeric, skip.date.conversion. Also, the user selects which columns and rows to import. See examples.

As R is case-sensitive while SAS is not, to avoid confusion all variable names are converted to lower case.

Value

The output is a data.table with the columns requested in keep (or all columns) and the rows requested in where (or all rows) up to obs many rows.

Author(s)

Anders Munch a.munch@sund.ku.dk and Thomas A Gerds tag@biostat.ku.dk

References

This function is based on pioneering work by Jesper Lindhardsen.

Examples

# We first set a working directory in which we have read and write permission
# These functions will produce temporary files which, if save.tmp is not set to TRUE, will
# be removed afterwards.

## Not run: 
setwd("v:/Data/Workdata/704791/AndersMunch/readSAS/R")

# Before importing a SAS data file it is useful to look into the
# format of the variables:
contentSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl")
contentSAS(filename="x:/Data/Rawdata_Hurtig/704791/pop.sas7bdat")

# Also, it is often a good idea to initially only read a limited amount of data
df101 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",obs=101,save.tmp=TRUE)
# and to examine the result
str(df101)
df101

# Format, dates, numeric, character
df101 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",obs=101,
                   save.tmp=TRUE,date.vars="inddto",
                   numeric.vars="pnr",character.vars="packsize")

# we can also use the pre.hook to limit the number of observations via sas options:
importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
          pre.hook="options obs=17;",where="diag='DN899'",keep=c("PNR","diag"),show.sas.code=1L)

# NOTE: In combination with a where statement SAS will find the first 101 observations that
#       satisfy the where statement. When the where statement finds nothing then SAS
#       will run through the whole file without finding anything

# To import the whole file just remove the limitations:
# df <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",show.sas.code=TRUE)
# But: this is not useful when files are huge and you should to look
# at the examples below which show how to read selected columns and rows only

# In huge data it is efficient to select only the required columns
# and to read only the rows of the data set that meet a criterion.
# The following example shows how to select columns with the keep option
# and how to specify the "where statement" of the SAS data step:
df0 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=101,
                 keep=c("diag","diagtype"),
                 where="diagtype = 'A' and diag = '45490'")

# Examine the imported data.table
str(df0)
df0
# using where contains is powerful
importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=10,
                 save.tmp=TRUE,
                 where="diag contains 'DI2'",
                 keep=c("PNR","diag","inddto","uddto"))


# Another typical task is to import only the rows which correspond to
# a set of pnr numbers or a set of diagnoses or both.
# To achieve this the function importSAS is merging files during the import.
# The feature is called 'filter' and illustrated in the following examples:
# Example 1: import only rows of diag_indl which correspond to one of the
#            first 23 pnr's in pop
pop <- importSAS(filename="x:/Data/Rawdata_Hurtig/704791/pop.sas7bdat",
                 obs=23,keep="PNR")
df1 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=8501,filter=pop)
str(df1)
df1

# Example 2: import only rows of diag_indl which correspond the
#            first pnr's in pop for which also diag is equal to DK409
pop2 <- importSAS(filename="x:/Data/Rawdata_Hurtig/704791/pop.sas7bdat",
                 obs=4223,keep="PNR")
pop2 <- pop2[,diag:="DZ508"]
df2 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=8501,keep=c("pnr","diag","inddto","uddto","pattype"),
                 where="pattype ne 3",filter=pop2)
str(df2)
df2

## Sometimes the sas data file cannot be read due to unknown formats
## here is how to solve this:
df2a <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=101,
                 pre.hook="options nofmterr;")


# The hooks set.hook and step.hook can be used as follows:
df3 <- importSAS(filename="X:/Data/Rawdata_Hurtig/704791/diag_indl",
                 obs=101,
                 keep=c("PNR","diag","diagtype"),
                 set.hook="firstobs=5",
                 step.hook="by PNR; firstPNR=first.PNR;",
                 where="diag contains 'I'")
str(df3)
df3

# if you are more fluent in SAS than in R it may help to
# know how to communicate e.g., a keep statement, instead of
# using the keep argument:

df4 <-importSAS(filename="X:/Data/Rawdata_Hurtig/999999/Dream201801",
                set.hook="keep=pnr branch;",
                obs=1000)

# Because the "overwrite" argument is FALSE, running the above code again will abort the import
# to not overwrite the temporary files.
# Setting "overwrite=TRUE" will allow the function to overwrite the files.


## End(Not run)

tagteam/heaven documentation built on Oct. 24, 2024, 7:40 p.m.