README.md

Travis build status

cidacsdict

This R package provides utilities to read, transform, translate, and clean data using CIDACS dictionaries, including data directly from DATASUS. It includes methods to translate a pre-defined exhaustive dictionary from Portuguese to English as well as recoding categorical variables to have meaningful factor levels.

Installation

# install.packages("devtools")
devtools::install_github("cidacsdict")

Usage

A data dictionary provided by CIDACS is available in this package as a dataset, cdcs_dict.

cdcs_dict
#> # A tibble: 209 x 17
#>       id db_name db_name_en name  name_en label label_en label_google_en
#>    <dbl> <chr>   <chr>      <chr> <chr>   <chr> <chr>    <chr>          
#>  1     1 Cadast… Single Re… cod_… househ… Códi… Househo… Code of the pl…
#>  2     2 Cadast… Single Re… cod_… househ… Códi… Househo… Code of the fa…
#>  3     3 Cadast… Single Re… cod_… househ… Códi… Househo… Household mate…
#>  4     4 Cadast… Single Re… cod_… househ… Códi… Househo… Code of househ…
#>  5     5 Cadast… Single Re… cod_… househ… Códi… Househo… Household sani…
#>  6     6 Cadast… Single Re… cod_… househ… Códi… Househo… Code of househ…
#>  7     7 Cadast… Single Re… cod_… househ… Códi… Househo… Code of illumi…
#>  8     8 Cadast… Single Re… cod_… sex     Códi… Code in… Code that indi…
#>  9     9 Cadast… Single Re… cod_… kinship Códi… Code fo… Code of kinshi…
#> 10    10 Cadast… Single Re… cod_… race    Códi… Person’… Code of the pe…
#> # ... with 199 more rows, and 9 more variables: map <list>, map_en <list>,
#> #   type <chr>, presence <chr>, presence_en <chr>, comments_en <chr>,
#> #   db <chr>, map_en_orig <chr>, map_orig <chr>

This provides variable names, labels, types, and mappings in English and Portuguese.

Mappings provide a transformation from integer encoding of categorical variables to more meaningful factor levels. For example, the first record in the dictionary is for "household location code", and the mapping is:

cdcs_dict$map_en[[1]]
#> $`1`
#> [1] "Urban"
#> 
#> $`2`
#> [1] "Rural"
#> 
#> $`0`
#> [1] "Not informed"
#> 
#> $`99`
#> [1] "Omitted"

Reading Data

There is a utility function to read in data obtained publicly from DATASUS:

fin <- "ftp://ftp.datasus.gov.br/dissemin/publicos/SINASC/NOV/DNRES/DNRO2013.DBC"
fout <- tempfile(fileext = ".DBC")
download.file(fin, destfile = fout)
d <- read_datasus(fout)

Note that this requires an ftp connection within Brazil.

Transforming and Cleaning Data

Two synthetic datasets have been included with the package to illustrate how transforming and cleaning data works. These datasets resemble data from the DATASUS SINASC and SIM databases. These datasets are sinasc_example and sim_example.

A function transform_data() will take a dataset and apply a given dictionary to it, transforming the variables to the appropriate types and mapping categorical variables.

A function clean_data() takes the resulting transformed data and performs some additional cleaning steps based on experience with this type of data. Cleaning includes removing implausible variables and adding state and region codes based on provided municipality codes, using the brazilgeo package.

# look at original data
str(sinasc_example)
#> 'data.frame':    100 obs. of  63 variables:
#>  $ NUMERODN  : chr  "69098242" "65909036" "65865119" "65912343" ...
#>  $ CODINST   : chr  "MAM1302600001" "MPR4101800001" "MMG3106200001" "MPA1505430001" ...
#>  $ ORIGEM    : chr  "1" "1" "1" "1" ...
#>  $ NUMERODV  : chr  "3" "3" "5" "2" ...
#>  $ PREFIXODN : chr  "30" "30" "30" "30" ...
#>  $ CODESTAB  : chr  "2654024" "2077701" "2081644" "2020068" ...
#>  $ CODMUNNASC: chr  "310620" "530010" "530010" "410640" ...
#>  $ LOCNASC   : chr  "1" "1" "1" "1" ...
#>  $ IDADEMAE  : chr  "32" "34" "39" "16" ...
#>  $ ESTCIVMAE : chr  "2" "5" "2" "5" ...
#>  $ ESCMAE    : chr  "4" "2" "4" "3" ...
#>  $ CODOCUPMAE: chr  "999992" "999992" "999991" "354705" ...
#>  $ QTDFILVIVO: chr  "01" "01" "00" "00" ...
#>  $ QTDFILMORT: chr  "00" "00" "00" NA ...
#>  $ CODMUNRES : chr  "240810" "330455" "354870" "355030" ...
#>  $ GESTACAO  : chr  "5" "4" "5" "5" ...
#>  $ GRAVIDEZ  : chr  "1" "1" "1" "1" ...
#>  $ PARTO     : chr  "2" "1" "2" "1" ...
#>  $ CONSULTAS : chr  "4" "2" "4" "4" ...
#>  $ DTNASC    : chr  "20062015" "07012015" "13022015" "18042015" ...
#>  $ HORANASC  : chr  "1955" "1725" "1950" "0115" ...
#>  $ SEXO      : chr  "2" "2" "2" "2" ...
#>  $ APGAR1    : chr  "09" "08" "08" "07" ...
#>  $ APGAR5    : chr  "10" "09" "09" "10" ...
#>  $ RACACOR   : chr  "4" "4" "1" "4" ...
#>  $ PESO      : chr  "3455" "2850" "2970" "2925" ...
#>  $ IDANOMAL  : chr  "2" "2" "2" "2" ...
#>  $ DTCADASTRO: chr  "06112015" "15052015" "18052015" "15072015" ...
#>  $ CODANOMAL : chr  NA NA NA NA ...
#>  $ NUMEROLOTE: chr  "20150021" "20150017" "20150008" "20150012" ...
#>  $ VERSAOSIST: chr  "3.2.01" "3.2.01" "3.2.01" "3.2.01" ...
#>  $ DTRECEBIM : chr  "11022016" "10122015" "13102015" "24082016" ...
#>  $ DIFDATA   : chr  "024" "424" "009" "050" ...
#>  $ DTRECORIG : chr  "21092015" "16102015" "12022015" "05082015" ...
#>  $ NATURALMAE: chr  "831" "823" "835" "852" ...
#>  $ CODMUNNATU: chr  "354780" "313670" "261250" "260010" ...
#>  $ CODUFNATU : chr  "31" "42" "31" "42" ...
#>  $ ESCMAE2010: chr  "3" "5" "3" "3" ...
#>  $ SERIESCMAE: chr  NA NA NA NA ...
#>  $ DTNASCMAE : chr  "20121999" "23101984" "01021984" "05031978" ...
#>  $ RACACORMAE: chr  "4" "1" "4" "4" ...
#>  $ QTDGESTANT: chr  "00" NA "01" "02" ...
#>  $ QTDPARTNOR: chr  "01" "00" "00" NA ...
#>  $ QTDPARTCES: chr  "00" "00" "00" "00" ...
#>  $ IDADEPAI  : chr  NA NA "34" NA ...
#>  $ DTULTMENST: chr  NA "08112014" "20062014" NA ...
#>  $ SEMAGESTAC: chr  "35" "38" "37" "40" ...
#>  $ TPMETESTIM: chr  "8" "8" "8" "8" ...
#>  $ CONSPRENAT: chr  "07" "07" "06" "06" ...
#>  $ MESPRENAT : chr  "02" "03" "03" "05" ...
#>  $ TPAPRESENT: chr  "1" "2" "1" "1" ...
#>  $ STTRABPART: chr  "2" "1" "2" "2" ...
#>  $ STCESPARTO: chr  "1" NA NA NA ...
#>  $ TPNASCASSI: chr  "2" "1" "1" "1" ...
#>  $ TPFUNCRESP: chr  "2" "2" "2" "5" ...
#>  $ TPDOCRESP : chr  "4" "4" "0" "0" ...
#>  $ DTDECLARAC: chr  "23032015" "17012015" "02022015" "23012015" ...
#>  $ ESCMAEAGR1: chr  "04" "08" "12" "05" ...
#>  $ TPROBSON  : chr  "03" "06" "02" "04" ...
#>  $ STDNEPIDEM: chr  "0" "0" "0" "0" ...
#>  $ STDNNOVA  : chr  "1" "1" "1" "1" ...
#>  $ CODPAISRES: chr  "1" "1" "1" "1" ...
#>  $ PARIDADE  : chr  "1" "1" "1" "0" ...

d <- transform_data(sinasc_example, subset(cdcs_dict, db == "SINASC"))
#> There are 20 variables in the data that either aren't in the dictionary or aren't in 'keep':
#>   numerodn
#>   codinst
#>   origem
#>   numerodv
#>   prefixodn
#>   codestab
#>   dtcadastro
#>   numerolote
#>   versaosist
#>   dtrecebim
#>   difdata
#>   dtrecorig
#>   codmunnatu
#>   tpfuncresp
#>   tpdocresp
#>   dtdeclarac
#>   tprobson
#>   stdnepidem
#>   stdnnova
#>   paridade
#> There are 5 variables in the dictionary or 'keep' that aren't in the data:
#>   cepnasc: Zip code of the birth place
#>   codbainasc: Borough code of the birth place
#>   codestocor: Federation Unit Code of the birth
#>   cepres: Zip code of the residence
#>   codbaires: Borough of residence code
#> Mapping sexo...
#> Mapping idanomal...
#> Mapping racacor...
#> Mapping locnasc...
#> Mapping estcivmae...
#> Mapping escmae...
#> Mapping escmae2010...
#> Mapping racacormae...
#> Mapping escmaeagr1...
#> Mapping gestacao...
#> Mapping gravidez...
#> Mapping parto...
#> Mapping consultas...
#> Mapping tpmetestim...
#> Mapping tpapresent...
#> Mapping sttrabpart...
#> Mapping stcesparto...
#> Mapping tpnascassi...
d <- clean_data(d)
#> Fixing implausible values of apgar1...
#> Fixing implausible values of apgar5...
#> Fixing implausible values of n_live_child...
#> Fixing implausible values of n_dead_child...
#> Adding birth year...
#> Fixing implausible values of birth weight...
#> Fixing implausible values of mother's age...
#> Fixing muni codes...
#> Adding state, micro, meso codes for birth_muni_code...
#> Adding state, micro, meso codes for m_muni_code...

# look at the result
str(d)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  50 variables:
#>  $ m_muni_code         : int  120060 130260 130260 130260 130260 140010 150020 150178 150442 150460 ...
#>  $ birth_muni_code     : int  150360 317020 240810 510340 412800 292740 150140 352850 530010 530010 ...
#>  $ birth_place         : Factor w/ 6 levels "Hospital","Other health facilities",..: 1 1 1 1 3 1 1 1 1 1 ...
#>  $ m_age_yrs           : int  23 33 25 23 41 32 19 30 34 21 ...
#>  $ marital_status      : Factor w/ 8 levels "Single","Married",..: 1 1 2 1 1 1 1 1 4 2 ...
#>  $ m_educ              : Factor w/ 8 levels "None","1 to 3 years",..: 4 4 5 4 5 5 5 5 3 4 ...
#>  $ occ_code            : int  421105 999992 999992 NA 999992 411010 999992 515105 NA 999992 ...
#>  $ n_live_child        : int  2 0 0 0 0 NA 2 1 2 1 ...
#>  $ n_dead_child        : int  0 0 0 1 0 1 0 0 0 0 ...
#>  $ gest_weeks_cat      : Factor w/ 9 levels "Less than 22 weeks",..: 4 5 5 5 5 5 5 5 4 5 ...
#>  $ preg_type           : Factor w/ 6 levels "Singleton","Twins",..: 1 2 1 1 1 1 1 1 1 1 ...
#>  $ deliv_type          : Factor w/ 5 levels "Vaginal","Cesarean",..: 2 1 1 2 1 1 2 2 2 2 ...
#>  $ n_prenat_visit_cat  : Factor w/ 7 levels "None","from 1 to 3",..: 4 4 4 4 4 3 4 3 4 3 ...
#>  $ birth_date          : Date, format: "2015-05-26" "2015-09-02" ...
#>  $ birth_time          : chr  "0431" "0137" "0724" "2134" ...
#>  $ sex                 : Factor w/ 5 levels "Male","Female",..: 2 1 2 1 1 1 1 2 2 1 ...
#>  $ apgar1              : int  10 9 8 NA 8 9 7 8 8 8 ...
#>  $ apgar5              : int  10 10 NA 10 9 10 9 9 8 9 ...
#>  $ race                : Factor w/ 7 levels "White","Black",..: 1 1 4 4 1 4 1 4 4 4 ...
#>  $ brthwt_g            : int  3595 3830 2660 3450 3390 4140 2312 3170 2665 3595 ...
#>  $ cong_anom           : Factor w/ 5 levels "Yes","No","Ignored",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ cong_icd10          : chr  NA NA NA NA ...
#>  $ m_birth_country_code: int  831 852 833 826 852 832 815 831 851 823 ...
#>  $ m_fu_code           : int  NA 35 23 35 26 35 23 33 NA 11 ...
#>  $ m_educ_2010         : Factor w/ 9 levels "No schooling",..: 3 2 2 2 2 2 5 3 3 3 ...
#>  $ m_educ_grade        : int  8 NA 6 3 2 3 NA 3 3 3 ...
#>  $ m_birth_date        : Date, format: "1972-07-06" "1989-03-27" ...
#>  $ m_race              : Factor w/ 6 levels "White","Black",..: NA 1 4 1 1 4 4 4 4 2 ...
#>  $ n_prev_preg         : int  3 1 0 0 0 0 5 0 2 0 ...
#>  $ n_vag_deliv         : int  0 2 NA 4 0 0 0 0 0 3 ...
#>  $ n_ces_deliv         : int  0 0 1 0 0 0 2 NA 0 0 ...
#>  $ f_age_yrs           : int  33 23 NA NA 18 32 NA 25 NA 32 ...
#>  $ menstrual_date_last : Date, format: "2014-06-21" NA ...
#>  $ gest_weeks          : int  38 39 39 39 40 40 40 39 37 38 ...
#>  $ gest_method         : Factor w/ 5 levels "Physical exam",..: 2 NA NA 1 1 2 NA NA NA 2 ...
#>  $ n_prenat_visit      : int  10 9 11 7 9 6 4 8 9 10 ...
#>  $ gest_month_precare  : int  1 2 3 5 2 NA 2 2 7 NA ...
#>  $ presentation        : Factor w/ 6 levels "Cephalic","Pelvic or breech",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ labor_induced       : Factor w/ 6 levels "Yes","No","Not applicable",..: 2 1 1 2 2 2 2 2 2 1 ...
#>  $ ces_pre_labor       : Factor w/ 6 levels "Yes","No","Not applicable",..: 1 1 NA 2 NA 2 NA NA 1 NA ...
#>  $ birth_assist        : Factor w/ 7 levels "Doctor","Nurse / midwife",..: 1 1 1 1 1 NA 1 1 1 1 ...
#>  $ m_educ_2010agg      : Factor w/ 14 levels "No Schooling 2 - Fundamental I Incomplete",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ m_country_code      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ birth_year          : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
#>  $ birth_state_code    : chr  "PA" "MG" "RN" "MT" ...
#>  $ birth_micro_code    : chr  "150016" "310059" "240001" "510001" ...
#>  $ birth_meso_code     : chr  "1505" "3111" "2401" "5101" ...
#>  $ m_state_code        : chr  "AC" "AM" "AM" "AM" ...
#>  $ m_micro_code        : chr  "120005" "130001" "130001" "130001" ...
#>  $ m_meso_code         : chr  "1202" "1301" "1301" "1301" ...


ki-tools/cidacsdict documentation built on May 12, 2019, 10:51 a.m.