les <- 3 load("course_urls.RData") knitr::opts_chunk$set(echo = TRUE, class.source="Rchunk", class.output="Rout")
Picture by Hamid Hajihusseini, CC BY 3.0, via Wikimedia Commons
knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, fig.width = 6, fig.height = 4 ) image_dir <- here::here( "images" )
## packages library(tidyverse) library(readxl)
After this lesson you will be familiar with agile working:
And with the most important data management guidelines and principles:
Do you recognize this?
knitr::include_graphics( file.path( image_dir, "final_final.png" ) )
https://medium.com/@jameshoareid/final-pdf-finalfinal-pdf-actualfinal-pdf-cae61ab1d94c
And this?
knitr::include_graphics( file.path( image_dir, "final.jpg" ) )
The use of version control abolishes the need for inventing a file name every time you save it. You will learn more about using version control (git and github.com) in lesson 2 and 3. With the use of git version control you only have to think about naming a file just once with a good name. But what entitles a 'good' file name?
A good file name is:
_
and a .
before the extension. Having multiple dots (.
) in a file name can be confusing but sometimes is required. For example for an archive we sometimes see <file_name>.tar.gz
UpperFirst
camelcase instead.*The special characters you should avoid in a file name:
! @ # $ % ^ & * ( ) + - = : " ' ; ? > < ~ / ? { } [ ] \ | ` ,
Special characters are reserved for other purposes and can cause problems when a back-up of the files is made or when files need to be loaded in analyzing software or when copying files.
**Basically, what was stated about file names, also applied to naming variables in a dataset. Or, with other words: choosing or creating valid names for columns in a data frame, or names of R objects for that matter.
knitr::include_graphics( here::here( "images", "bad-characters.png" ) )
Below, I show an example of badly formatted file name and column names to make the point.
knitr::include_graphics( file.path(image_dir, "bad_formatting_file_name_and_headers.jpg") )
To help you build a thorough data management process for yourself that you can start using and expanding when needed, we need a framework. In this course we use the Guerrilla Analytics
framework, as described by Enda Ridge in this booklet. If you are pursuing a career in Data Science, I highly recommend getting a copy! This booklet describes in a very practical, and hands-on way, how to establish a data management workflow. Either when you work all by yourself or in a team, the pointers in this book will be applicable in both situations. Once you get to know this framework, you will discover that you used to do it wrong (like me...).
To build the framework, let's look at the core 7 Guerrilla Analytics Principles:
- Space is cheap, confusion is expensive
- Use simple, visual project structures and conventions
- Automate (everything - my addition) with program code
- Link stored data to data in the analytics environment to data in work products (literate programming with RMarkdown - my addition)
- Version control changes to data and analytics code (Git/Github.com)
- Consolidate team knowledge (agree on guidelines and stick to it as a team)
- Use code that runs from start to finish (literate programming with RMarkdown - my addition)
[Guerrilla Analytics book by Enda Ridge, ](https://guerrilla-analytics.net/)
knitr::include_graphics( here::here("images", "guerrilla_analytics.jpg") )
As you can see from my own edits to these principles, quite a few are immediately applicable when using a programming Language like R and its Rmd format, that we use all the time. I will go over each principle below in more detail. But first an exercise.
Now that we encountered a data management challenge in the Wild, let's build our framework to be able to tackle these types of problems in a more structured fashion, next time we meet them.
This principle is simple: storage costs are low these days so there is no need to spend a lot of time on administrating files.
knitr::include_graphics( file.path( image_dir, "docking_station_harddisk.jpg" ) )
When receiving a file from a laboratory that has performed a sequencing analysis, the files you receive are usually in .fastq.gz
or fasta.gz
format. Because these files can be big, they are usually accompanied by a small file containing a hash-like string looking something like this:
r tools::md5sum(here::here("data", "tidy_Kopie van salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx"))
This is the md5sums checksums for the file used earlier: ./data/salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx
. When the file changes the checksums also changes, like we will see in the following exercise.
There are a number of different algorithms with which we can calculate such so-called sumchecks
. Here we use the md5sums, which is a popular hashing algorithm.
md5sums are
In the following example, we find the MD5sum for a .txt file in /data/md5_examples:
library(tools) md5_ex1_raw <- tools::md5sum( here::here( "data", "md5_examples", "MD5_exampledata_1.txt" ) )
Use enframe() to get atomic vectors or lists nice and tidy in a tibble:
md5_ex1_raw %>% enframe() -> md5sums_ex1
Click for the answer
#library(tools) myDir <- here::here( "data", "md5_examples2") fileNames <- list.files(myDir, recursive = TRUE) tools::md5sum(file.path(myDir, fileNames)) %>% enframe() -> md5sums_all md5sums_all$filename <- fileNames md5sums_all %>% select(filename,value)
The above example in the exercise is an example of a so-called parameterized script. In this case a parameterized RMarkdown. We will learn more on parameterizing RMarkdown files in lesson \@ref(rmarkdownparams)
- The covid Rmd is parameterized on date and country
- The script automatically includes the parameters in the title of the report and the captions of the figures
- The 'rendered' date is automatically set, for tracking and versioning purposes
- Parameterization can used to automate reporting for many values of parameters
- Further automation is easy now (although the ECDC has regular 'changes' to their latest data available for download - and they do not use md5sums!! - This makes full automation and building--in checks more difficult)
knitr::include_graphics(here::here( "images", "covid_rmd_screenshot.jpg" ) )
stored data
, to data in the analytics environment
, to data in work products
Read this paragraph header again and make sure you understand the difference between "stored data", "data in the analytics environment" and "data in work products".
Basically this is what you are doing with literate programming (e.g. RMarkdown) with R or Python in RStudio or Jupyter:
analytics environment
is the Global Environment (where variables and R-objects live)
- When you do data analysis, you should use code. See also Principle 4.
- When you write code, you should use Git, preferably in combination with Github. Or use another version control system.
- Hence: When you do data analysis, you should use Git & Github
- Git/Github is 'track-changes for code'
You will learn more on using the git/github workflow in data science in later lessons (\@ref(gitintro), \@ref(gitrstudio), \@ref(gitbranchmerge), \@ref(gitcollaboration)).
AKA communicate! When working together it is vital to come to an agreement on how you work together. I hope the Guerrilla Analytics framework provides a starting point. Hopefully, you will learn during your projecticum work how vital this actually is when working together on a data project or any project for that matter. Here are some pointers:
- Make guidelines on data management, storage places and workflows
- Agree within the team on them
- Stick to them! And be frank about it when things go wrong or people misbehave. An honest and open collaborative environment is encouraging. It is usually hard for people to change their way of working.
- Work together in a virtual collaboration environment.
- Work together on code using Github or some other version control based repository system (e.g. Gitlab / Bitbucket).
- Provide for education and share best practices within the organization, the department and/or the team (this is what we try to achieve with this course).
{purrr}
package. I prefer these above writing for()
loops because they focus on the object that is being changed, not on the computation that is done.Meta data is data about the data, such as for instance the type of variables, number of observations, experimental design and who gathered the data. This is quite often not reliably documented (or at least not easily accessible) but very important: data without context loses some of its purpose.
Take a look at this Wikipedia image of cocoa pods and scroll down. As you can see, Wikipedia stores a lot of metadata on file usage, licence, author, date, source, file size... Even the original meta data from the camera is included (scroll to the bottom).
Meta information for data files, like type of variables, ranges, units, number of observations or subjects in a study, type of analysis or experimental design often goes in a README.txt file or a sheet in the Excel file containing the data. Keep the readme information close to the data file. Also, information about who is the owner of the data or who performed the experiment when and where and with what type of device or reagents is very useful to include. In our exercise above such README information would have saved us a lot of time figuring out what is what in the Excel file, don't you think?
An example of a readme file is depicted below.
knitr::include_graphics( here::here( "images", "readme_updated.png"))
It does not need to be very long, but provides information on where the data (which project?) refers to, who the owners are, who to contact in case of questions and what are the contents of the data (variable description)
Rmd files include a metadata section themselves: the YAML header. At the very least, specify the title, author and date here.
Save at the very least in your Readme.txt:
Here is a nice example file by the university of Bath for bioinformatics projects, another more general template available for download, and here another template for experimental data
This may seem to be a bit too much for your current projects, but try and see how much you could fill in and keep the template for future projects! Remember that any metadata is better than none.
This is an overview of all datafiles in a project.
Keep an MS Excel file (called "data_log.xlsx") in the root of the folder \data
of each project and keep it up to date to track all the files present in this folder. Provide names and additional information here. Meta information for the \data_raw
folder is best kept in that folder in a README.txt
file.
Use versioning of data files. Decide on a versioning system for yourself (we showed you an example before, but you are free to look for other options), an stick to it.
Some metadata is most useful if you have it available in the code directly.
In order to reduce effort in generating a complete tidy table for your data it might be worthwhile to create a number of extra tables containing meta data. Typically this is how it would work:
Assume we have wide data format originally created in Excel looking like this: (Actually, we will ask R to generate some data that looks like it was imported from excel instead, because we didn't feel like copy-pasting 4x100 numbers to make an excelfile...)
# generate some dummy data for the example measured1 <- rbinom(100, size = 2, prob = 0.3) measured2 <- rnorm(100, mean = 5.3, sd = 0.1) measured3 <- rnbinom(100, size = 10, prob = 0.1) concentration <- rep(1:10, 10) # put it in a tibble data <- tibble::tibble( `concentration (mMol/l)` = concentration, `measured 1 (pg/ml)` = measured1, `measured 2 (ng/ml)` = measured2, `measured 3 (ng/ml)` = measured3 ) data
The r names(data)
are the variable names as provided in Excel. As you can see they are not adherent in a few ways tot the aforementioned naming conventions ( \@ref(namingconv)). The r names(data)[2:ncol(data)]
refer to three variables that were determined in some experiment. The units of measurements are (as is common in Excel files) mentioned between brackets in the column name.
For compatibility and inter-operability reasons this data format can be improved to a more machine readable format:
In this case, the unit information that is included in the variable names can be considered metadata. So you can put that information in a separate table. In the example below, I will call it coldata (short for column data)
First we need to create a pivoted table where the first column represents the variable names of our data
table. Then we need to add a row for each variable in our data. It is best if the variable names and the values in the meatdata table in column 1 excactly match (in term of spelling and typesetting). I will show how this looks for our data
var_names <- names(data) metadata <- tibble::tibble( varnames = var_names ) metadata
We now have a metadata
table with one column called varnames
. However, we are not done. If we want to create a tidy format of our metadata table we need to separate the unit information from the variable names column. Let's extract the units into it's own column
metadata %>% mutate( varnames = str_replace_all( varnames, pattern = " ", replacement = "")) %>% separate( varnames, into = c("varnames", "units"), sep = "\\(", remove = FALSE) %>% mutate( units = str_replace_all( units, pattern = "\\)", replacement = "")) -> metadata_clean metadata_clean
We can now start adding additonal information such as remarks or methods to the the metadata column.
methods <- c("dilution", "elisa", "lcms", "flow cytometry") remarks <- c( "concentration of exposure compound", "compound x is related to elevated blood pressure" )
The folder \doc
contains documentation and can be basically everything concerning information about the project, not concerning the data. For example a PowerPoint presentation on the experimental design of a study, or a contract or something else. Data information goes in the “supporting’ folder that is in the same folder as where the data file it refers to is stored.
Now that we have a framework with which we can build our work flows in a data science project we can start working and collaborating. Below I resume some key concepts that are useful when working in a data science team.
This information is not new, but DAUR1 is a while ago, so we'll repeat it: data can be formatted in different ways.
During the different R courses we have been working with data in the tidy
format frequently.
knitr::include_graphics( file.path(image_dir, "tidy-1.png") )
From: ["R for Data Science", Grolemund and Wickham](https://r4ds.had.co.nz/)
Although this is a optimized format for working with the {tidyverse}
tools it is not the only suitable data format. We already encountered an important other structure that is much used in Bioinformatics: SummarizedExperiment
. This class of data format is optimized for working with Bioconductor
packages and work flows.
knitr::include_graphics( here::here( "images", "summarizedexperiment.png" ))
Morgan M, Obenchain V, Hester J, Pagès H (2021). SummarizedExperiment: SummarizedExperiment container. R package version 1.22.0, https://bioconductor.org/packages/SummarizedExperiment.
For machine learning purposes, data is often formatted in the wide
format. We see an example here:
data(package = "mlbench", dataset = "BostonHousing") BostonHousing %>% as_tibble()
The different variables are arranged in a side-by-side fashion. In this example the data is still tidy, but there are also examples of wide formatted data that is not tidy. When you want to work with this data, you generally need to transform it in a stacked or so-called long format that works well for {tidyverse}
. We will see an example in the next exercise
Click for the answer
answer
# reading in the data - without any special settings library(readxl) data_platereader <- read_xlsx( here::here( "data", "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx" ), sheet = "All Cycles" ) ## this data looks mangled because of several things: # there is some metadata in the top region of the sheet # there is a weird looking headers (two headers?) ## trying skip data_platereader <- read_xlsx( here::here( "data", "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx" ), sheet = "All Cycles", skip = 11 ) ## clean up and fix names data_platereader <- data_platereader %>% rename(sample = Time, well = ...1) %>% janitor::clean_names() ## which wells have data? unique(data_platereader$well) ## create sample table sample_names <- data_platereader$sample mv_utr_tx100 <- rep(c("mv", "mv", "mv", "mv", "untr", "untr", "untr", "untr", "untr", "tx100", "tx100", "tx100"), times = 8) salmonella <- read_xlsx( here::here( "data", "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx" ), sheet = "layout", range = "C5:N13" ) %>% janitor::clean_names() # cheack data types map( .x = salmonella, typeof ) salmonella <- salmonella %>% pivot_longer(ul_sal_1:ul_sal_12, names_to = "plate_column", values_to = "microliters_bacteria") ## synthesize to sample table samples <- tibble( well = data_platereader$well, sample = sample_names, condition = mv_utr_tx100, ul_salmonella = salmonella$microliters_bacteria ) ## join sample table with data data_join <- left_join(samples, data_platereader) ## create tidy version data_tidy <- data_join %>% pivot_longer( x0_h:x24_h_5_min, names_to = "time", values_to = "value" ) ## fix time variable data_tidy_time <- data_tidy %>% mutate(time_var = str_replace_all( string = time, pattern = "x", replacement = "" )) %>% mutate(time_var = str_replace_all( string = time_var, pattern = "_", replacement = "" )) %>% mutate(time_var = str_replace_all( string = time_var, pattern = "h", replacement = ":" )) %>% mutate(time_var = str_replace_all( string = time_var, pattern = "min", replacement = "" )) %>% separate( col = time_var, into = c("hours", "minutes"), remove = FALSE ) %>% mutate( minutes = ifelse(minutes == "", "0", minutes) ) %>% mutate(minutes_passed = 60*as.numeric(hours) + as.numeric(minutes)) ## misingness data_tidy %>% naniar::vis_miss() ## graphs data_tidy_time %>% group_by(condition, ul_salmonella, minutes_passed) %>% summarise(mean_value = mean(value)) %>% mutate(ul_salmonella = round(as.numeric(ul_salmonella), 2)) %>% ggplot(aes(x = minutes_passed, y = mean_value)) + geom_line(aes(colour = condition), show.legend = FALSE) + facet_grid(condition ~ ul_salmonella) + xlab("Time passed (minutes)") + ylab("Mean AU")
0
/1
sex
, species
, marital_status
etcyear
, month
, etcHere we use the {palmerpenguins}
dataset as an example to show you how they dealt with encoding variables.
[palmerpenguins](https://github.com/allisonhorst/palmerpenguins)
# install.packages("remotes") # remotes::install_github("allisonhorst/palmerpenguins") library(palmerpenguins) data_penguins <- palmerpenguins::penguins_raw data_penguins
Make sure you are consistent in entering the data!
library(ggplot2) # simulating inconsistent data entry penguinswrong <- penguins levels(penguinswrong$species) <- c(levels(penguinswrong$species), "adelie") penguinswrong$species[1:5]<-"adelie" # make a box plot of flipper length showing a/Adelie as separate species flipper_box <- ggplot(data = penguinswrong, aes(x = species, y = flipper_length_mm)) + geom_boxplot(aes(color = species), width = 0.3, show.legend = FALSE) + geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE, position = position_jitter(width = 0.2, seed = 0)) + scale_color_manual(values = c("darkorange","purple","cyan4","red")) + theme_minimal() + labs(x = "Species", y = "Flipper length (mm)") flipper_box
R (unlike SPSS) does not mind if you use descriptive words instead of numbers as categorical variable values. This increases reproducility! GGplot doesn't mind either. (Machine learning workflows may mind, but we're not doing machine learning here.) The different possible options for such a variable are called the levels of this factor:
data_penguins %>% ggplot(aes(x = Sex, y = `Flipper Length (mm)`)) + geom_point(aes(colour = Species), position = "jitter", show.legend = FALSE) unique(data_penguins$Sex) ## we call these factor levels
When we store data for re-use, we need it to be in an interoperable form. This means that it can be read (also after a long time - let's say 30 years from now) into analysis software. This can be achieved by storing data in a so-called non-proprietary
format. This means basically that the format source code is open and maintained by open source community or core development team.
Here are some examples:
.netCDF
(Geo, proteomics, array-oriented scientific data).xml
/.mzXML
(Markup language, human and machine readable, metadata + data together).txt
/.csv
(flat text file, usually tab, comma or semi colon (;
) seperated).json
(text format that is completely language independent)fastq
/fasta
and their equivalents
These formats will remain readable, even if the format itself becomes obsolete
When storing a curated dataset for sharing or archiving it is always better, and sometimes enforced by the repository, to choose a non-proprietary format
Data entry preferably, must be performed in a project template. The template contains predefined information on the observations in the study. The blank information needs to be filled out by the person responsible for/or performing the data entry. Or in other words: think about how you will enter your data before gathering it, and if there are multiple people gathering the data, make sure that everyone uses the exact same way of entering the data (template).
Enter an “NA” for missing values, do not leave cells blank if there is a missing value. Use only “NA” and nothing else. If you want to add additional information on the “NA”, put that in the “remarks” column.
(By the way, you can visualise missing data like this in R with the naniar package: )
naniar::vis_miss(data_penguins)
Or check out a ggplot method here.
After entry (and validation) of the filled-out template, NEVER change a value in the data. If you want to make changes, increment the version number of the file and document the change in the README.txt file or sheet in an Excel file (see below)
A tidy data template you may want to use is available here
When you are planning to use this template, please be aware the following pointers:
For Excel users:
Maak opdracht 3a en 3b van de portfolio-opdrachten.
CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Unless it was borrowed (there will be a link), in which case, please use their license.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.