In DataScienceILC/tlsc-dsfb26v-20_workflows: Bookdown site for workflows course

Data management {#represdata}

les <- 3
load("course_urls.RData")
knitr::opts_chunk$set(echo = TRUE, class.source="Rchunk", class.output="Rout")

Picture by Hamid Hajihusseini, CC BY 3.0, via Wikimedia Commons

knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE, 
  fig.width = 6,
  fig.height = 4
)

image_dir <- here::here(
  "images"
)

## packages
library(tidyverse)
library(readxl)

Aim of this lesson

After this lesson you will be familiar with agile working:

And with the most important data management guidelines and principles:

Proper file naming strategies
Guerrilla Analytics principles
Files and folders / Project structure
Checking data validity/integrity using checksums
Data-formats & Data shapes / Tidy data
metadata
Encoding variables & Exploratory Data Analysis
Preparing your data for sharing
Proprietary vs non-proprietary formats

File names {#namingconv}

Do you recognize this?

knitr::include_graphics(
  file.path(
  image_dir,
    "final_final.png"
  )
)

https://medium.com/@jameshoareid/final-pdf-finalfinal-pdf-actualfinal-pdf-cae61ab1d94c

And this?

knitr::include_graphics(
  file.path(
  image_dir,
    "final.jpg"
  )
)

The use of version control abolishes the need for inventing a file name every time you save it. You will learn more about using version control (git and github.com) in lesson 2 and 3. With the use of git version control you only have to think about naming a file just once with a good name. But what entitles a 'good' file name?

A good file name is:

Unique in a folder (prevent duplicated names)
Is short, but descriptive (if you need it to be longer to be descriptive enough, choose that)
Does not contain any special characters* except for _ and a . before the extension. Having multiple dots (.) in a file name can be confusing but sometimes is required. For example for an archive we sometimes see <file_name>.tar.gz
The typeface of a file name is ideally set in lowercase only. If you want to deviate from this use UpperFirst camelcase instead.
The most important thing about naming files is to be consistent. This is also the hardest part!
If you receive a file from somebody else: Never change the file name, even if it does not meet the above requirements. Changing a file name causes a breakage between the file and the source it came from. If you change a file name you recieved from a person or downloaded from the internet, the person who send the file will not know about the new name.

*The special characters you should avoid in a file name:

! @ # $ % ^ & * ( ) + - = : " ' ; ? > < ~ / ? { } [ ] \ | ` ,

Special characters are reserved for other purposes and can cause problems when a back-up of the files is made or when files need to be loaded in analyzing software or when copying files.

**Basically, what was stated about file names, also applied to naming variables in a dataset. Or, with other words: choosing or creating valid names for columns in a data frame, or names of R objects for that matter.

knitr::include_graphics(
  here::here(
    "images",
    "bad-characters.png"

  )
)

Below, I show an example of badly formatted file name and column names to make the point.

knitr::include_graphics(
  file.path(image_dir, "bad_formatting_file_name_and_headers.jpg")
)

##### Exercise `r les` {-} what is wrong with this file name and its headers? can you spot another problem with the data sheet?

The Guerrilla Analytics Principles

To help you build a thorough data management process for yourself that you can start using and expanding when needed, we need a framework. In this course we use the Guerrilla Analytics framework, as described by Enda Ridge in this booklet. If you are pursuing a career in Data Science, I highly recommend getting a copy! This booklet describes in a very practical, and hands-on way, how to establish a data management workflow. Either when you work all by yourself or in a team, the pointers in this book will be applicable in both situations. Once you get to know this framework, you will discover that you used to do it wrong (like me...).

To build the framework, let's look at the core 7 Guerrilla Analytics Principles:

Space is cheap, confusion is expensive

Use simple, visual project structures and conventions

Automate (everything - my addition) with program code

Link stored data to data in the analytics environment to data in work products (literate programming with RMarkdown - my addition)

Version control changes to data and analytics code (Git/Github.com)

Consolidate team knowledge (agree on guidelines and stick to it as a team)

Use code that runs from start to finish (literate programming with RMarkdown - my addition)

[Guerrilla Analytics book by Enda Ridge, ](https://guerrilla-analytics.net/)

knitr::include_graphics(
  here::here("images", "guerrilla_analytics.jpg")
)

As you can see from my own edits to these principles, quite a few are immediately applicable when using a programming Language like R and its Rmd format, that we use all the time. I will go over each principle below in more detail. But first an exercise.

##### Exercise `r les` {-} Imagine you receive a file attached to an email from a researcher in your research group called: `salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx` The file is located in the course `./data` folder. Download it here. You are requested to run an analysis on the raw data in this file (sheet named 'All Cycles'). It contains data from a plate reader experiment where wells are measured over time. The researcher asks you to generate a plot per sample. No other information was provided in the original email. Describe the following steps in an RMarkdown file. You do not need to write the R code for the actual analysis at this point - we will do that later in another exercise in this lesson. Answer these questions in your markdown file with exercises. 1. How would you prepare for this analysis in R/Rstudio? 1. Look at the contents of the file, which Excel Worksheet do you need for the analysis? 1. Which steps do you need to take to load the data into R 1. Which steps do you need to take in reshaping the file to make the dataformat match the requirement for using `{ggplot} 1. Think of a better file name 1. Write a conceptual and kind and friendly, but clear reply email (in English) to the researcher, where you address the changes that the researcher needs to make to the file in order for you to be able to analyze this data in R.

Now that we encountered a data management challenge in the Wild, let's build our framework to be able to tackle these types of problems in a more structured fashion, next time we meet them.

The Guerrilla Principles explained

Principle 1: Space is cheap, confusion is expensive

This principle is simple: storage costs are low these days so there is no need to spend a lot of time on administrating files.

Keep your files, you never know when you need them. The price for a storage units has dramatically dropped over the past years, so there is no need to delete files or clean up any more. Just save old files in an archive.
Store data that you actively work on in online-cloud storage. Usually, you will be better of storing data on a remote location in the cloud. This reduces the risk of data loss, when something happens to the local file system. Cloud storage is usually redundant. Which means copies of the same file are ditributed over mupltiple locations (even geographically if you want). When one disk fails, there is always a back up, so no real data loss if hardware dies.
Protect yourself: do not click on attachments and spiffy emails, cyber criminals are getting smarter everyday.
Create md5sums for important (source) data-files to ensure data integrity and data transfer validity. A MD5 ‘hash’ or ‘checksum’ is a 128-bit summary of the file contents. Files with different MD5 sums are different. So you can use this to check whether a file has changed since you last used it.
Agree on a storage system, share it, use it, stick to it

knitr::include_graphics(
  file.path(
    image_dir,
    "docking_station_harddisk.jpg"
  )
)

Data integrity {-}

When receiving a file from a laboratory that has performed a sequencing analysis, the files you receive are usually in .fastq.gz or fasta.gz format. Because these files can be big, they are usually accompanied by a small file containing a hash-like string looking something like this:

r tools::md5sum(here::here("data", "tidy_Kopie van salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx"))

This is the md5sums checksums for the file used earlier: ./data/salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx. When the file changes the checksums also changes, like we will see in the following exercise.

There are a number of different algorithms with which we can calculate such so-called sumchecks. Here we use the md5sums, which is a popular hashing algorithm.

md5sums are

A unique code to identify a file
Can be used to verify the integrity or the version of a file
Can be genarated from Windows, MacOS, Linux or from within e.g. R/Python/Bash
md5sums are also used for safety: checking an md5sum ensures that the code is valid and has not changed (e.g. Anaconda)
There are many different types of hash functions MD5, SHA256 are much used for data and software

In the following example, we find the MD5sum for a .txt file in /data/md5_examples:

library(tools)
md5_ex1_raw <- tools::md5sum(
  here::here(
    "data",
    "md5_examples",
    "MD5_exampledata_1.txt"
  )
)

Use enframe() to get atomic vectors or lists nice and tidy in a tibble:

md5_ex1_raw %>% enframe() -> md5sums_ex1

##### Exercise `r les` {-} There are actually 4 very similar files lying around. Download the data here. Find out which one is different from the other three using their MD5sums. If you want, `md5sum` can handle multiple files at the same time, but wants full paths for all of them.

Click for the answer

#library(tools)

myDir <- here::here(
    "data",
    "md5_examples2")

fileNames <- list.files(myDir, recursive = TRUE)

tools::md5sum(file.path(myDir, fileNames)) %>% enframe() -> md5sums_all
md5sums_all$filename <- fileNames
md5sums_all %>% select(filename,value)

##### Exercise `r les`{-} **Calculating and checking md5sums checksums in R** For the data in this exercise use the file `toxrefdb_nel_lel_noael_loael_summary_AUG2014_FOR_PUBLIC_RELEASE.csv` [toxrefdb_nel_lel_noael_loael_summary_AUG2014_FOR_PUBLIC_RELEASE.csv](data/toxrefdb_nel_lel_noael_loael_summary_AUG2014_FOR_PUBLIC_RELEASE.csv) (a) Determine the md5sums of the file. Save the checksums to a file [write a piece of R code]

Click for the answer

wzxhzdk:13

(b) Upload the file to the [Rstudio server](`r serverurl`) and check again. Again, save the resulting md5sums in a file. Do the md5sums you calculated on the server and the md5sums of the local file match?

Click for the answer

You should get `r md5sums_toxref$value` as md5sums on both the server and locally

##### Exercise `r les` {-} Let's demonstrate this principle 3 with a COVID-19 reporting example >- Imagine we want daily reports on the number of COVID-19 cases and caused deaths >- We want to be able to dynamically report data for different countries and dates to compare situations in the World >- The data is available (for manual and automated download) from the European Center for Disease Control >- The analysis can be coded completely from begin to end to result in the information we need Take a look at the source file in this [Github repo (click)](https://raw.githubusercontent.com/uashogeschoolutrecht/covid_demo/master/app/covid_data.Rmd){target="_blank"} Download this file to your RStudio environment and knit the file. What do you see in the rendered html? What happens if you change some of the parameters in the `yaml` header of the file, in particular country_code or from_to_date?

Parameterization {-}

The above example in the exercise is an example of a so-called parameterized script. In this case a parameterized RMarkdown. We will learn more on parameterizing RMarkdown files in lesson \@ref(rmarkdownparams)

The covid Rmd is parameterized on date and country

The script automatically includes the parameters in the title of the report and the captions of the figures

The 'rendered' date is automatically set, for tracking and versioning purposes

Parameterization can used to automate reporting for many values of parameters

Further automation is easy now (although the ECDC has regular 'changes' to their latest data available for download - and they do not use md5sums!! - This makes full automation and building--in checks more difficult)

knitr::include_graphics(here::here(
    "images",
    "covid_rmd_screenshot.jpg"
  )
)

Principle 4: Link `stored data`, to data in the `analytics environment`, to data in `work products`

Read this paragraph header again and make sure you understand the difference between "stored data", "data in the analytics environment" and "data in work products".

Basically this is what you are doing with literate programming (e.g. RMarkdown) with R or Python in RStudio or Jupyter:

The data is stored on disk or in the Cloud
The analytics environment is the Global Environment (where variables and R-objects live)
Data is pulled from the storage in the Analytics Environment by a script
The work products (Rmd / Notebooks) bring it together

Principle 5: Version control for data and code - Git/Github

When you do data analysis, you should use code. See also Principle 4.

When you write code, you should use Git, preferably in combination with Github. Or use another version control system.

Hence: When you do data analysis, you should use Git & Github

Git/Github is 'track-changes for code'

You will learn more on using the git/github workflow in data science in later lessons (\@ref(gitintro), \@ref(gitrstudio), \@ref(gitbranchmerge), \@ref(gitcollaboration)).

Principle 6: Consolidate team knowledge

AKA communicate! When working together it is vital to come to an agreement on how you work together. I hope the Guerrilla Analytics framework provides a starting point. Hopefully, you will learn during your projecticum work how vital this actually is when working together on a data project or any project for that matter. Here are some pointers:

Make guidelines on data management, storage places and workflows

Agree within the team on them

Stick to them! And be frank about it when things go wrong or people misbehave. An honest and open collaborative environment is encouraging. It is usually hard for people to change their way of working.

Work together in a virtual collaboration environment.

Work together on code using Github or some other version control based repository system (e.g. Gitlab / Bitbucket).

Provide for education and share best practices within the organization, the department and/or the team (this is what we try to achieve with this course).

##### Exercise `r les` {-} **tips** Please share all tips for fellow students on the prikbord in teams!

Principle 7: Prefer analytics code that runs from start to finish

Create work products in RMarkdown (or Jupyter notebooks if you like)
Write functions that isolate code and can be recycled. When writing a function, think about how to generalize this function so that you can recycle it in other projects. This saves time and adds robustness.
Use iterations to prevent repetition. Write clear and compact loops, for example in R by using the map-family of functions from the {purrr} package. I prefer these above writing for() loops because they focus on the object that is being changed, not on the computation that is done.
For the future: In R, create an R-package. Once you have a fully functional RMarkdown file, it is quite easy to rebase that code into an R Package. I call this the "Start with Rmd" - workflow. The demo in the link shows you an example, you don't need to do the demo now, but use it in the future if you like.

Metadata {#metadata1}

Meta data is data about the data, such as for instance the type of variables, number of observations, experimental design and who gathered the data. This is quite often not reliably documented (or at least not easily accessible) but very important: data without context loses some of its purpose.

Take a look at this Wikipedia image of cocoa pods and scroll down. As you can see, Wikipedia stores a lot of metadata on file usage, licence, author, date, source, file size... Even the original meta data from the camera is included (scroll to the bottom).

Meta information for data files, like type of variables, ranges, units, number of observations or subjects in a study, type of analysis or experimental design often goes in a README.txt file or a sheet in the Excel file containing the data. Keep the readme information close to the data file. Also, information about who is the owner of the data or who performed the experiment when and where and with what type of device or reagents is very useful to include. In our exercise above such README information would have saved us a lot of time figuring out what is what in the Excel file, don't you think?

An example of a readme file is depicted below.

knitr::include_graphics(
  here::here(
    "images",
    "readme_updated.png"))

It does not need to be very long, but provides information on where the data (which project?) refers to, who the owners are, who to contact in case of questions and what are the contents of the data (variable description)

YAML header

Rmd files include a metadata section themselves: the YAML header. At the very least, specify the title, author and date here.

saving metadata

Save at the very least in your Readme.txt:

General information
- title
- information about the authors (name, institution, email address)
- date of data collection
- location of data collection
license information
a data log with for each data file:
- short description
- date the file was created
- variable list (full names and explanations)
- units of measurements
- definition for missing data (NA)
methodological information
- description of methods (you can link to piublications or protocols)
- description of data processing (link to RMarkdown file)
- any software or specific instruments used
- Describe details that may influence reuse or replication efforts

Here is a nice example file by the university of Bath for bioinformatics projects, another more general template available for download, and here another template for experimental data

This may seem to be a bit too much for your current projects, but try and see how much you could fill in and keep the template for future projects! Remember that any metadata is better than none.

Data-log

This is an overview of all datafiles in a project.

Keep an MS Excel file (called "data_log.xlsx") in the root of the folder \data of each project and keep it up to date to track all the files present in this folder. Provide names and additional information here. Meta information for the \data_raw folder is best kept in that folder in a README.txt file.

Use versioning of data files. Decide on a versioning system for yourself (we showed you an example before, but you are free to look for other options), an stick to it.

Annotations and meta data in multiple files

Some metadata is most useful if you have it available in the code directly.

In order to reduce effort in generating a complete tidy table for your data it might be worthwhile to create a number of extra tables containing meta data. Typically this is how it would work:

Assume we have wide data format originally created in Excel looking like this: (Actually, we will ask R to generate some data that looks like it was imported from excel instead, because we didn't feel like copy-pasting 4x100 numbers to make an excelfile...)

# generate some dummy data for the example
measured1 <- rbinom(100, size = 2, prob = 0.3)
measured2 <- rnorm(100, mean = 5.3, sd = 0.1)
measured3 <- rnbinom(100, size = 10, prob = 0.1)
concentration <- rep(1:10, 10)

# put it in a tibble
data <- tibble::tibble(
  `concentration (mMol/l)` = concentration,
  `measured 1 (pg/ml)` = measured1,
  `measured 2 (ng/ml)` = measured2,
  `measured 3 (ng/ml)` = measured3
)
data

The r names(data) are the variable names as provided in Excel. As you can see they are not adherent in a few ways tot the aforementioned naming conventions ( \@ref(namingconv)). The r names(data)[2:ncol(data)] refer to three variables that were determined in some experiment. The units of measurements are (as is common in Excel files) mentioned between brackets in the column name.

For compatibility and inter-operability reasons this data format can be improved to a more machine readable format:

In this case, the unit information that is included in the variable names can be considered metadata. So you can put that information in a separate table. In the example below, I will call it coldata (short for column data)

First we need to create a pivoted table where the first column represents the variable names of our data table. Then we need to add a row for each variable in our data. It is best if the variable names and the values in the meatdata table in column 1 excactly match (in term of spelling and typesetting). I will show how this looks for our data

var_names <- names(data)
metadata <- tibble::tibble(
  varnames = var_names
)
metadata

We now have a metadata table with one column called varnames. However, we are not done. If we want to create a tidy format of our metadata table we need to separate the unit information from the variable names column. Let's extract the units into it's own column

metadata %>%
  mutate(
    varnames = str_replace_all(
      varnames,
      pattern = " ",
      replacement = "")) %>%
  separate(
    varnames,
    into = c("varnames", "units"), sep = "\\(", remove = FALSE) %>%
  mutate(
    units = str_replace_all(
      units,
      pattern = "\\)",
      replacement = "")) -> metadata_clean
metadata_clean

We can now start adding additonal information such as remarks or methods to the the metadata column.

methods <- c("dilution", "elisa", "lcms", "flow cytometry")
remarks <- c(
  "concentration of exposure compound",
  "compound x is related to elevated blood pressure"
  )

Study documentation

The folder \doc contains documentation and can be basically everything concerning information about the project, not concerning the data. For example a PowerPoint presentation on the experimental design of a study, or a contract or something else. Data information goes in the “supporting’ folder that is in the same folder as where the data file it refers to is stored.

##### Exercise `r les` {-} Provide study documentation and meta data for your last laboratory project. While you are at it, add `/code` folders for any scripts you write (outside of the RMarkdown files) and `/R` folders for any functions you write.

What your data should look like

Now that we have a framework with which we can build our work flows in a data science project we can start working and collaborating. Below I resume some key concepts that are useful when working in a data science team.

Data formatting

This information is not new, but DAUR1 is a while ago, so we'll repeat it: data can be formatted in different ways.

During the different R courses we have been working with data in the tidy format frequently.

knitr::include_graphics(
  file.path(image_dir, "tidy-1.png")
)

Each variable goes in its own column
Each observation goes in its own row
Each cell contains only one value

From: ["R for Data Science", Grolemund and Wickham](https://r4ds.had.co.nz/)

Although this is a optimized format for working with the {tidyverse} tools it is not the only suitable data format. We already encountered an important other structure that is much used in Bioinformatics: SummarizedExperiment. This class of data format is optimized for working with Bioconductor packages and work flows.

knitr::include_graphics(
  here::here(
    "images",
    "summarizedexperiment.png"
  ))

Morgan M, Obenchain V, Hester J, Pagès H (2021). SummarizedExperiment: SummarizedExperiment container. R package version 1.22.0, https://bioconductor.org/packages/SummarizedExperiment.

For machine learning purposes, data is often formatted in the wide format. We see an example here:

data(package = "mlbench", dataset = "BostonHousing")
BostonHousing %>% as_tibble()

The different variables are arranged in a side-by-side fashion. In this example the data is still tidy, but there are also examples of wide formatted data that is not tidy. When you want to work with this data, you generally need to transform it in a stacked or so-called long format that works well for {tidyverse}. We will see an example in the next exercise

##### Exercise `r les` {-} **Transforming data in a reproducible way** Remember the well plate experiment? Here we will perform a transformation on the data to make the data suitable for analysis in R. We will also create a single graph showing all the data available for all samples over measured time in the experiment. The data file for this exercise can be found here: `./data/salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx` Download it here if you've lost it and take a good look at your file management... Go over the following steps to complete this exercise. 1. Review your answers to the previous exercise where we used this file in this lesson. 1. Try reading the sheet called `All Cycles` in the Excel file. 1. What goes wrong with the formatting of the data if you start reading in the data from cell A1? 1. Try solving this problem. 1. What shape of formatting would you say this data is in? Is the data tidy? 1. Write a piece of code that creates a tidy format of this data. You also need to take a look a the sheet called `layout` to get information on the samples. Try generating a manual data frame that has 96 rows and a code for each sample. The experiment has been performed in duplo, so for each experimental condition there are two samples. 1. Now join your `sample data` dataframe to the raw data. 1. Export the data as a .csv file. 1. Write an appropriate README.txt file that accompanies this exported csv file. Save both in your `data` folder of your course project. (make this folder if you don't have it yet). **TIPS: - Remember: `dplyr::pivot_longer()` and `dplyr::pivot_wider()` are very helpful when you want to reshape your data in R - After reading your data into R: be sure to check the datatype of the columns - Create a sample data table containing sample information for each of the 96 samples mentioned in the `All Cycles` sheet. The information you need to do this is contained in sheet `layout` - the `time` variable in this dataset is a nasty one. It is recorded in an uncenventional way. You need to use some cleaning up code to transform this variable in numbers (use `str_replace_all()`, to get rid of the stupid characters like `x` and the `_`. Next you can transform this variable into hours and minutes by using `seperate()`. Be aware that there will be 'empty' cells in the `minutes` column - Do you think it is a good idea to have graphs in an Excel worksheet that contains the raw data? - How would you have stored the Raw data of this experiment?

Click for the answer

answer

# reading in the data - without any special settings
library(readxl)

data_platereader <- read_xlsx(
  here::here(
    "data",
    "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx"
  ), sheet = "All Cycles"
)
## this data looks mangled because of several things: 
# there is some metadata in the top region of the sheet
# there is a weird looking headers (two headers?)

## trying skip
data_platereader <- read_xlsx(
  here::here(
    "data",
    "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx"
  ), sheet = "All Cycles", skip = 11
)

## clean up and fix names
data_platereader <- data_platereader %>%
  rename(sample = Time, well = ...1) %>%
  janitor::clean_names()

## which wells have data?
unique(data_platereader$well)


## create sample table
sample_names <- data_platereader$sample

mv_utr_tx100 <- rep(c("mv", "mv", "mv", "mv", 
                      "untr", "untr", "untr", "untr", "untr",
                      "tx100", "tx100", "tx100"), times = 8)

salmonella <- read_xlsx(
  here::here(
    "data",
    "salmonella CFU kinetics OD600 in LB van ipecs 8okt2020 kleur.xlsx"
  ), sheet = "layout", range = "C5:N13"
) %>%
  janitor::clean_names() 

# cheack data types
map(
  .x = salmonella,
  typeof
)

salmonella <- salmonella %>%
  pivot_longer(ul_sal_1:ul_sal_12,
               names_to = "plate_column", 
               values_to = "microliters_bacteria")

## synthesize to sample table

samples <- tibble(
  well = data_platereader$well,  
  sample = sample_names,
  condition = mv_utr_tx100,
  ul_salmonella = salmonella$microliters_bacteria
)

## join sample table with data
data_join <- left_join(samples, data_platereader)

## create tidy version
data_tidy <- data_join %>%
  pivot_longer(
    x0_h:x24_h_5_min,
    names_to = "time",
    values_to = "value"
  )

## fix time variable
data_tidy_time <- data_tidy %>%
  mutate(time_var =
  str_replace_all(
    string = time,
    pattern = "x",
    replacement = ""
  )) %>%
  mutate(time_var =
  str_replace_all(
    string = time_var,
    pattern = "_",
    replacement = ""
  )) %>%
  mutate(time_var =
  str_replace_all(
    string = time_var,
    pattern = "h",
    replacement = ":"
  )) %>%
  mutate(time_var =
  str_replace_all(
    string = time_var,
    pattern = "min",
    replacement = ""
  )) %>%
  separate(
    col = time_var,
    into = c("hours", "minutes"),
    remove = FALSE
  ) %>%
  mutate(
    minutes = ifelse(minutes == "", "0", minutes)
  ) %>%
  mutate(minutes_passed = 60*as.numeric(hours) + as.numeric(minutes))

## misingness
data_tidy %>%
  naniar::vis_miss()

## graphs
data_tidy_time %>%
  group_by(condition, ul_salmonella, minutes_passed) %>%
  summarise(mean_value = mean(value)) %>%
  mutate(ul_salmonella = round(as.numeric(ul_salmonella), 2)) %>%
  ggplot(aes(x = minutes_passed, y = mean_value)) +
  geom_line(aes(colour = condition), show.legend = FALSE) +
  facet_grid(condition ~ ul_salmonella) +
  xlab("Time passed (minutes)") +
  ylab("Mean AU")

Variable encodings

Use explicit encoding: male/female instead of 0/1
Encodings can always be altered programmatically
Be consistent (see Figure \@ref(fig:flipperplot))
Write documentation that explains encodings, including units and levels
Use factors if a variable has a set of discrete possible outcomes: sex, species, marital_status etc
Use an ordered factor if there is a hiearchy in the factor levels: e.g. year, month, etc

Here we use the {palmerpenguins} dataset as an example to show you how they dealt with encoding variables.

[palmerpenguins](https://github.com/allisonhorst/palmerpenguins)

# install.packages("remotes")
# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
data_penguins <- palmerpenguins::penguins_raw 
data_penguins

Make sure you are consistent in entering the data!

library(ggplot2)

# simulating inconsistent data entry
penguinswrong <- penguins
levels(penguinswrong$species) <- c(levels(penguinswrong$species), "adelie")
penguinswrong$species[1:5]<-"adelie"

# make a box plot of flipper length showing a/Adelie as separate species
flipper_box <- ggplot(data = penguinswrong, aes(x = species, y = flipper_length_mm)) +
  geom_boxplot(aes(color = species), width = 0.3, show.legend = FALSE) +
  geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE, position = position_jitter(width = 0.2, seed = 0)) +
  scale_color_manual(values = c("darkorange","purple","cyan4","red")) +
  theme_minimal() +
  labs(x = "Species",
       y = "Flipper length (mm)")

flipper_box

Factor levels

R (unlike SPSS) does not mind if you use descriptive words instead of numbers as categorical variable values. This increases reproducility! GGplot doesn't mind either. (Machine learning workflows may mind, but we're not doing machine learning here.) The different possible options for such a variable are called the levels of this factor:

data_penguins %>%
  ggplot(aes(x = Sex, y = `Flipper Length (mm)`)) +
  geom_point(aes(colour = Species), position = "jitter", show.legend = FALSE)
unique(data_penguins$Sex) ## we call these factor levels

Data-formats - Non-Proprietary

When we store data for re-use, we need it to be in an interoperable form. This means that it can be read (also after a long time - let's say 30 years from now) into analysis software. This can be achieved by storing data in a so-called non-proprietary format. This means basically that the format source code is open and maintained by open source community or core development team. Here are some examples:

.netCDF (Geo, proteomics, array-oriented scientific data)

.xml / .mzXML (Markup language, human and machine readable, metadata + data together)

.txt / .csv (flat text file, usually tab, comma or semi colon (;) seperated)

.json (text format that is completely language independent)

fastq / fasta and their equivalents

These formats will remain readable, even if the format itself becomes obsolete

When storing a curated dataset for sharing or archiving it is always better, and sometimes enforced by the repository, to choose a non-proprietary format

Data entry

Data entry preferably, must be performed in a project template. The template contains predefined information on the observations in the study. The blank information needs to be filled out by the person responsible for/or performing the data entry. Or in other words: think about how you will enter your data before gathering it, and if there are multiple people gathering the data, make sure that everyone uses the exact same way of entering the data (template).

Enter an “NA” for missing values, do not leave cells blank if there is a missing value. Use only “NA” and nothing else. If you want to add additional information on the “NA”, put that in the “remarks” column.

(By the way, you can visualise missing data like this in R with the naniar package: )

naniar::vis_miss(data_penguins)

Or check out a ggplot method here.

After entry (and validation) of the filled-out template, NEVER change a value in the data. If you want to make changes, increment the version number of the file and document the change in the README.txt file or sheet in an Excel file (see below)

Tidy template

A tidy data template you may want to use is available here

When you are planning to use this template, please be aware the following pointers:

For Excel users:

Start your file in A1 of a clean new sheet.
Use row 1 for variable names, use the rest of the rows for observations
It is allowed to have existing sheets in the file. Label the new tidy formatted datasheet as "tidy"
Do not fuse cells
Provide one value per cell so e.g. put units in a separate column
Adhere to the naming conventions for variable names (see above)
Use cell validation for entered values. This is especially true if you have multiple categorical variables that need manual labeling in Excel. A typo is just around the corner.

Portfolio assignment {-}

Maak opdracht 3a en 3b van de portfolio-opdrachten.

Resources

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Unless it was borrowed (there will be a link), in which case, please use their license.

DataScienceILC/tlsc-dsfb26v-20_workflows documentation built on July 4, 2025, 5:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DataScienceILC/tlsc-dsfb26v-20_workflows
Bookdown site for workflows course

In DataScienceILC/tlsc-dsfb26v-20_workflows: Bookdown site for workflows course

Data management {#represdata}

Aim of this lesson

File names {#namingconv}

The Guerrilla Analytics Principles

The Guerrilla Principles explained

Principle 1: Space is cheap, confusion is expensive

Data integrity {-}

Parameterization {-}

Principle 4: Link `stored data`, to data in the `analytics environment`, to data in `work products`

Principle 5: Version control for data and code - Git/Github

Principle 6: Consolidate team knowledge

Principle 7: Prefer analytics code that runs from start to finish

Metadata {#metadata1}

YAML header

saving metadata

Data-log

Annotations and meta data in multiple files

Study documentation

What your data should look like

Data formatting

Variable encodings

Factor levels

Data-formats - Non-Proprietary

Data entry

Tidy template

Portfolio assignment {-}

Resources

R Package Documentation

Browse R Packages

We want your feedback!

DataScienceILC/tlsc-dsfb26v-20_workflows Bookdown site for workflows course

In DataScienceILC/tlsc-dsfb26v-20_workflows: Bookdown site for workflows course

Data management {#represdata}

Aim of this lesson

File names {#namingconv}

The Guerrilla Analytics Principles

The Guerrilla Principles explained

Principle 1: Space is cheap, confusion is expensive

Data integrity {-}

Parameterization {-}

Principle 4: Link stored data, to data in the analytics environment, to data in work products

Principle 5: Version control for data and code - Git/Github

Principle 6: Consolidate team knowledge

Principle 7: Prefer analytics code that runs from start to finish

Metadata {#metadata1}

YAML header

saving metadata

Data-log

Annotations and meta data in multiple files

Study documentation

What your data should look like

Data formatting

Variable encodings

Factor levels

Data-formats - Non-Proprietary

Data entry

Tidy template

Portfolio assignment {-}

Resources

R Package Documentation

Browse R Packages

We want your feedback!

DataScienceILC/tlsc-dsfb26v-20_workflows
Bookdown site for workflows course

Principle 4: Link `stored data`, to data in the `analytics environment`, to data in `work products`