This document explains how to import data from SAS. The example is meant for preprocessing on Statistics Denmark servers. Using the package RODBC it is possible to take advantage of the SAS engine to generate smaller data sets to use in R.

Open connection to SAS

The package RODBC implements ODBC (open database connectivity). ODBC provides a common API (application programming interface) to access databases, so that you can access data originating from a variety of software applications.

library(RODBC)

The command odbcConnect opens a connection through a prespecified DSN (Data Source Name) making it possible to access data on a remote server. The data libraries on Statistics Denmarks are accessed using this DSN. The libnames for all projects on Statistics Denmark are predefined in a SAS macro and can only be changed by someone in Statistics Denmark.

The command below opens a connection to SAS called "cnn", and a SAS session is initialized in the background. When the SAS session is initialized, a dialog box opens in SAS. This window needs to be closed for the connection to initialize.

SAS should not be startet before, since this may cause the connection to fail.

cnn <- odbcConnect("SAS64", believeNRows=FALSE)

When the connection is initialized, it is possible to submit queries to the database using SQL (Structured Query Language).

The command sqlTables returns all accesible tables from the connection. Note the restriction of "raw3766", since only data from the specific projet is accesible.

sqlTables(cnn, schema="raw3766")

To reduce the data set, either restrict the columns or return only rows meeting some criteria.

The command sqlColumns returns available colunms in a table, eg. which columns are availble in the table called pop in raw3766.

sqlColumns(cnn, "raw3766.pop", as.is=TRUE)

The option as.is=TRUE lets characters be returned as character vectors and not converted.

Returning specific variables using SQL

Below are examples on how to reduce the data set by returning part of tables using SQL.

The expressions are made using the command sqlQuery, in which you can put a SELECT statement. These statements are on the form "SELECT column_name,column_name,... FROM table_name" where colunm_name is the variable to return. If all variables in the table is needed, use "SELECT * " (select all).

The query below returns all variables in the table pop.

pop <- sqlQuery(cnn, "SELECT * FROM raw3766.pop", as.is=TRUE)

To return records meeting specific criteria, use the "WHERE" clause as seen below. The command returns all variables in the table pop for records born after 1 january 2010.

pop <- sqlQuery(cnn, "SELECT * FROM raw3766.pop WHERE fdato > '01jan2010'd", as.is=TRUE)

The command below returns all variables in the table pop for records born after 1 january 2010 with sex = 1.

pop <- sqlQuery(cnn, "SELECT * FROM raw3766.pop WHERE fdato > '01jan2010'd AND sex = 1", as.is=TRUE)

The following commands use the table "diag_indl" as an example.

To display which columns are accesible, sqlColumns is used.

sqlColumns(cnn, "diag_indl", as.is=TRUE)

Returning all variables in the table diag_indl for records with the diagnosis codes I21, I22 and 410 for heart attacks:

heartattacks <- sqlQuery(cnn, "SELECT * FROM raw3766.diag_indl WHERE substr(diag, 1, 3) in ('I21', 'I22', '410')", as.is=TRUE)

Return only diagnosis code and pnr from diag_indl for records with pattype = 0:

admission <- sqlQuery(cnn, "SELECT diag,pnr FROM raw3766.diag_indl WHERE pattype = '0'", as.is=TRUE)

Close the ODBC connection

To close the connection use the function close.

close(cnn)


tagteam/heaven documentation built on March 24, 2024, 7:58 a.m.