library(knitr)
options(htmltools.dir.version = FALSE, cache=TRUE)
opts_chunk$set(comment = NA, prompt=TRUE)
#opts_chunk$set(dev.args=list(bg="transparent"), fig.width=15, fig.height=7)
source("kutheme.R")
library(dataMaid)
toyData <- as.data.frame(toyData)

Summarizing the errors


Data cleaning

knitr::include_graphics("pics/datacleaning.jpg")

.pull-right[Not the best term ... and should not be unsupervised]


Data cleaning in R

In an R-script:

  1. Make a copy of the dataset.
  2. Use indexing to locate the problem in the data.
  3. Overwrite the faulty value with a correct one - if you know it - or NA to mark that information is missing in this spot.
  4. Save the copy of the "cleaned" data in a new file.

Selection - rows/observations

Two systems for selecting observations in data.frames in R: By index (row number) or using a logical vector.

(tD <- head(toyData, 3))  

Selection - rows/observations

Four equivalent ways to get the second line of tD:

tD[2, ] #indexing
tD[c(FALSE, TRUE, FALSE), ] #manual logical vector 
tD[tD$id == 2, ] #informative logical vector
tD %>% filter(id==2)  # Using tidyverse
tD[tD$id == 2, ] #informative logical vector

Selection - rows/observations

Use informative logical vectors as much as possible!

tD

#Mark non-positive change as missing:
tD[tD$change > 0, "change"] <- NA

Selection - columns/variables

ALWAYS use variable names.

#readable, informative code:
tD[tD$change > 0, "change"] <- NA

# Indexing by numbers easily becomes 
# a source of error by itself:
tD[tD$change > 0, 4] <- NA

class: inverse

Exercise 4

Correct the errors you have found so far.

Make sure to make the cleaning process reproducible.

Remember rules 1 and 2!


background-image: url(pics/structure.png) background-size: 30%

Finishing up

Should now have
a cleaned dataset
that can form the
basis for future
analyses.

With documentation
of how we got
there!


Create codebook

Produce a summary document for subsequent analyses.

.footnotesize[

makeCodebook(bigPresidentData)

]

Add label (similar to labelled package) or extra information

.footnotesize[

bPD <- bigPresidentData
attr(bPD$presidencyYears, "label") <- 
  "Full years as president"

]

.footnotesize[

attr(bPD$dateOfDeath, "shortDescription") <- 
  "Missing means that the person is still alive"

]


class: inverse

Exercise 4b

Create the final codebook with additional information about some of the variables.

makeCodebook(myCleanedData)

class: middle, center

Thank you!

Please grab hold of us here or via email

.pull-left[Anne
ahpe@sund.ku.dk] .pull-right[Claus
ekstrom@sund.ku.dk]



ekstroem/dataMaid documentation built on Jan. 31, 2022, 9:10 a.m.