library(knitr) options(htmltools.dir.version = FALSE, cache=TRUE) opts_chunk$set(comment = NA, prompt=TRUE) #opts_chunk$set(dev.args=list(bg="transparent"), fig.width=15, fig.height=7) source("kutheme.R") library(dataMaid) toyData <- as.data.frame(toyData)
.center[
knitr::include_graphics("pics/datacleaning.jpg")
]
Not the best term ... and should not be unsupervised
In an R-script:
NA
to mark that information is missing in this spot. Two systems for selecting observations in data.frame
s in R:
By index (row number) or using a logical vector.
(tD <- head(toyData, 3))
Four equivalent ways to get the second line of tD
:
tD[2, ] #indexing tD[c(FALSE, TRUE, FALSE), ] #manual logical vector tD[tD$id == 2, ] #informative logical vector tD %>% filter(id==2) # Using tidyverse
tD[tD$id == 2, ] #informative logical vector
Use informative logical vectors as much as possible!
tD #Mark non-positive change as missing: tD[tD$change > 0, "change"] <- NA
ALWAYS use variable names.
#readable, informative code: tD[tD$change > 0, "change"] <- NA # Indexing by numbers easily becomes # a source of error by itself: tD[tD$change > 0, 4] <- NA
background-image: url(pics/structure.png) background-size: 30% background-position: right
Should now have
a cleaned dataset
that can form the
basis for future
analyses.
With documentation
of how we got
there!
Produce a summary document for subsequent analyses.
.footnotesize[
makeCodebook(presidentData)
]
Add label (similar to labelled
package) or extra information
.footnotesize[
pD <- presidentData attr(pD$presidencyYears, "label") <- "Full years as president"
]
.footnotesize[
attr(pD$birthday, "shortDescription") <- "Dates are stored in YYYY-MM-DD format"
]
class: inverse
Correct the errors you have found so far.
Make sure to make the cleaning process reproducible.
Remember rules 1 and 2!
Create the final codebook with additional information about some of the variables.
makeCodebook(myCleanedData)
knitr::include_graphics("pics/colrow1.png")
knitr::include_graphics("pics/colrow2.png")
knitr::include_graphics("pics/colrow3.png")
dataMaid
performs class dependent checks for each variable in a dataset, one at a time (column-wise)An R-packages that performs row-wise checks: validate
. Check out the talk on Wednesday @ 14.50 by Edwin de Jonge:
validatetools - resolve and simplify contradictive or redundant data validation rules
Note: Different use of the term "validation" - no longer about format, type and range, but used as synonym to "check".
class: middle, center
Please grab hold of us here or via email
.pull-left[Anne
ahpe@sund.ku.dk] .pull-right[Claus
ekstrom@sund.ku.dk]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.