"I just wasted five hours running an analysis on the wrong input file."
"I can't remember where I saved my output."
"I accidentally overwrote my raw data."
Many of us have probably said one or more of the above sentences. When you're balancing multiple projects and lots of data, it can be easy to lose track of files. One of the biggest challenges to any project is placing project files in a structure that is easy for you, the scientist, to access and maintain.
This chapter will cover the basics of project organization and data organization and management. At the end of this chapter, you should be familiar with:
This book will discuss not solely the Dendropy computing library, but efficient R computation for phylogenetic analyses. In this section, we are going to address a very specific aim: setting up projects on your computer in a way that will allow you to manage your projects efficiently. We will use hands-on examples throughout this section.
When managing a project, you want to keep files together as much as possible. Doing this makes it easier to document your project, and avoid losing data files.
Create an R Project together
Make two subdirectories: code and data
These four subdirectories will all serve important purposes. Our scripts directory will house the R code we will write. We will talk in further chapters about why it is important and useful to keep all of the scripts for a single project together. For now, just know that it is.
Our data directory will house our raw data. We want our raw data to be housed on its own. If we make a mistake in how we save output files, we can always go back to our raw data and run the analysis again ... so long as we have maintained the integrity of our raw data. Once we have populated our data directory with the necessary data, we do not write to it. A simple motto for this philosophy is that data are read-only.
We do, however, write to our output directory. We will talk in future chapters about using R to make readable and informative output file names. For now, just know that the results of any analysis go in the output directory.
Documentation is where we can put miscellaneous, informative files. For example, papers that you are reading for your project, cost sheets for sequencing, talk slides.
This is the basic setup all chapters of this book will rely on. This manner of file organization is very transparent: anyone looking at your file directory can understand what components of your research project are stored where. If you follow this structure, when you go to publish your paper, you can simply archive your entire project directory to meet most funders' and journal's guidelines for providing data and software code. And who couldn't use a little less on their plate when it comes time to submit a paper?
So far, we have introduced two important concepts: keep files together as much as possible and data are read-only. The first principle means to keep data, output and code related to one project in one place on your computer to increase your organization. The second principle means to treat raw data as untouchable. Any outputs of analyses should be kept separate to avoid loss of data, and to ensure you can rerun any steps as needed.
In learning these two concepts, we have used hands-on commands: mkdir to create directories, ls to look at their contents and cd to navigate. We also learned to check our navigation with the pwd command and to use tab-complete to increase the efficiency of our typing.
"I know I made a plot of this ... I just can't remember how."
"This looks totally different on my computer."
"My coworker doesn't have Excel."
Almost everyone used a spreadsheet program to manage data at some point in our career. These graphical programs offer a clean interface to view and manage the data we work so hard to collect. But these types of programs can also introduce problems in biological workflows.
There are three main issues with using Excel and other similar programs to carry out computations:
In this chapter, we will discuss moving data management and analysis past the traditional spreadsheet paradigm. Along the way, we will learn about how to store data in a way that is both human- and machine-readable. Using a test dataset on ant (Formicidae) taxonomy, we will import datasets using R and begin to explore them programmatically. At the end of this chapter, you will be familiar with:
When you save a file in Excel, you don't simply save the data. You save, encoded in binary, information about cell positioning, coloring, and other document attributes. In the previous section, we mentioned that the behavior of a file loaded into a spreadsheet viewing program can vary between versions of the software. This is not because the data are changing. This is because there are often subtle changes to how the data are displayed or how statistics are calculated. These subtle changes can cause dramatically different renderings of files across versions and platforms.
All of this extraneous information also makes the file harder to read at the command line, or in R. For example, in the data directory, you will find two data files: Ants.csv and Ants.xlsx. Try looking at each in a plain text editor.
Which of these files are you able to visualize?
The Ants.csv file is stored in what is called a flat file. Flat files are plain text - they don't contain any characters that can't be typed with a keyboard, or viewed at the command line. Any fancy formatting in a plain text file comes from the viewer. For example, this book is written in plain text. The nice formatting you see is the result of rendering software. If you were to view this file at the command line, you would still be able to access all of the data within it. Because flat files can be read by both human and machine, they are often considered preferable to files with extensive encoding of extra information.
For the purposes of this lesson, we will show you how to get data out of spreadsheet files (such as Excel files), but we will predominantly be describing how you can avoid using these file types.
Inside each flat file, data should be organized with each row being an observation, and each column being the variables observed. If you open the Ants.csv file, you will note that each row corresponds to one fossil ant. Each column is labeled as a variable observed about the ant - taxonomy, age of the fossil and notes. Most programmatic ways of handling data assume this structure.
You will also notice that all the column names have underscores, rather than spaces. Most programming languages will assume that a space indicates the end of a name, so it's best to avoid spaces. Lastly, notice that the data directory also contains a README. This README tells the user what data files they should expect to find when they download your data.
There are many ways to process data at the command line. These range from simple methods available in UNIX to elaborate and complex libraries in R. In this section, we'll get to know Pandas a little better. We will use R as a gateway to getting familiar with R. Once we have learned how to carry out some common spreadsheet operations, we will discuss R programming more generally in the next chapter. This is by no means an exhaustive look at the Pandas library. It is simply a teaser to show some useful ways that you might interact with data in R.
Pandas was written by Wes McKinney to facilitate efficient data processing, manipulation and plotting in R. The fundamental data type of R is a dataframe, an object containing rows and columns of data. For our Ants.csv data, the rows will be the fossil ants. The columns will be the observations (fossil minimum ages, maximum ages, and taxonomy). It's instructive to look at an example.
?read.csv
In the command read.csv, we called the function
read.csv` out of R's base. We
haven't provided it with any data to read, therefore, R simply told us that the function
is available via Pandas to parse text.
Now we will use this function to load some data into R, via the Pandas library:
ant_data <- read.csv('data/Ants.csv')
ant_data
What we see is the data. We are able to call up the data to view because the data have been saved to a variable, ant_data, which was stored in memory as a dataframe when we called read_csv.
Challenge:
Try this command again without saving it to a variable. - What happens?
- Is it what you expected?
- Why did this happen?
Let's take a closer look at the ant_data object. On the face of it, it looks very much like our Excel file. Now, we'll explore how to access data in this dataframe. Let's start by getting a look at all the names in the file. There are a couple ways we can do this:
ant_data$specimen
or
ant_data[[1]]
In our first block of code, we call the dataframe object and use what is called indexing to get the first column. Now, you may be thinking, what is going on with the notation? The brackets indicate that we will be accessing a column. This may be somewhat surprising - that the language doesn't start with one. It's actually quite natural- you weren't born at one year old, were you? But it still takes a little to get used to - programmers often verbally say the first item, but they mean the zeroeth item.
In our second block of code, we call the column by its name - specimen. This is a very natural way to access the data. We gave attributes of the data names, why not use them?
Challenge:
Try accessing other columns.
- When will it be useful to access by name?
- When will it not?
We can also select multiple columns at once using our column names.
library(tidyverse) ant_data %>% select(specimen, tribe)
Selecting rows of data is also possible. The below code grab all rows in the fifth and sixth columns of the dataframe:
ant_data[5:6]
This introduces a couple of interesting concepts. The first is using inclusive and exclusive indexing. This is an example of the 5 being inclusive - we will get the fifth row of data. The six is exclusive - we will stop accessing data at the 6th row.
In this chapter, we have discussed data storage using flat files, which can be read by both computers and humans. We introduced R libraries and learned how we can call functions out of a library to do useful tasks for us. And finally, we have started to use a small set of functions in the Pandas library to access data programmatically. In the next chapter, we will build on these concepts to perform a wider range of data tasks.
When would we not want to use lists of columns and rows to select data out of a dataframe? When we have a lot of data, which is rapidly becoming the norm in biology. In order to process data at a large scale, we need better tools to allow us to do complex data filtering, selection and manipulation. By using R, we automatically create a log of all the operations we perform as we access our data. By the end of this chapter, you will understand:
We often have an idea of the data we'd like to access. Perhaps we are interested to know what data we have for a specific genus, Oecophylla.
oeco <- ant_data %>% filter(genus == "Oecophylla")
These outputs have the same information, but they look rather different. Try assigning each output to a variable. Now, we will call the type function on each variable.
typeof(ant_data) typeof(oeco)
Type is a function that is built in to R. It allows us to know what sorts of objects our variables are, which in turn tells us about their properties.
A list is an R object type. It is a one-dimensional array (i.e. not a matrix). In some ways, if a dataframe is like a spreadsheet, a series is like a single row of column in that sheet. More information on the series data type can be found by typing
?list
in your notebook.
Using these complex indexers, we can begin to make multiple selections of data. For example, this syntax will select all the examples of Formica that are at least 30 million years old.
ant_data %>% filter(genus == "Formica") %>% filter(min_ma > 30)
Challenge
How could you print the minimum ages for all three Mianeuretus listed in this data file?
The above is more useful and flexible than what we discussed in Chapter Two. But it's still somewhat naive - we need to know where, exactly the data are in the file to retrieve them. We'll now cover a few methods to allow us to sort and access data without knowing this a priori.
Something we might want to do is know which of our ants are at least as old as the KPg event, the end of the Cretaceous. Ants are thought to have arisen during the Cretaceous. We can do this easily and assign our sample of old ants to a new dataframe:
old_ants <- ant_data %>% select(specimen, min_ma, max_ma) %>% filter(min_ma > 65)
The way this works is that we select all the values in the dataframe column min_ma that are larger than 65.5. By default, this will take the whole row. Then, we assign that to a new dataframe. We can write this dataframe to a csv file for storeage like so:
write_csv(old_ants, "ants_old.csv")
Challenge
- Try this again - how could we filter for ants that only existed after the end of the Creatceous? Can we use the same column?
- When you open the parentheses on a function argument and press tab, you can see all the arguments, or special options, available to that function. Try a couple, such as delimiters. Once you've chosen an argument, Shift+Tab will show you possible values for the argument.
We can also use this type of indexing to compare columns. Our max_ma column should always be bigger than min_ma. We can check for rows where this isn't the case - these are errors.
ant_data %>% filter(min_ma > max_ma)
In our data, we don't have any such errors, but if we did, we could drop them like so:
ant_data %>% filter(min_ma > max_ma)
Does anyone see a problem with what I did there?
I made a mistake and selected only the data that have a minimum age larger than the maximum age, and used it to overwrite the dataframe object! If I overwrote my data for real, this would be a big problem. But I didn't - R is not doing these operations in the actual spreadsheet. It's doing them on a dataframe held in memorey - our real data are perfectly safe in the Ants.csv file.
In the previous command, I switched the inequality to accidentally drop all the values I actually wanted to keep! So far, we've covered some important conceptual data lessons - that data are read-only, that data should be stored in flat files. We will now learn an important lesson as it pertains to code - that code can be saved for later, as an exact record of what we did in a session of programming.
This might not sound that important, but I think most of us have probably been working in a spreadsheet and completely forgotten how we got the useful annotations on a plot, of calculated a summary statistic. In this section, we will cover some best practices for keeping track of what we've done. We will expand on these best practices later, and in more complexity.
For now, let's have a look at the tool we've been using, the Jupyter Notebook. We've been typing into a window in our browser. That window is running R on our computer, and rendering the output in a nice format using Java. We have code cells, which we have been pasting code into cells and running it. In these notebooks, you can create a new cell, and chenge it's type by selecting something other than code from the dropdown. For example, we could pick 'Markdown' to write notes.
What we are doing when we do this is weaving together code and data to make readable documents that follow our whole workflow. When we are done for a session, we can select 'Save and Checkpoint' from the File menu. This will save our notebook for later. A notebook can be emailed to a collaborator, or, as we will cover later, managed under version control. We can also download the file as a plain R script in the File menu. We will discuss why we might want to do this in Chapter Four.
Together, what this means is that our code to reproduce the analysis we did is always available. If we need to go back and do it again, for example, if we get more data, that's easy to do. We just run the code again. If a colleague wants to try the analysis again, we can simply send them the code. This is much easier than trying to explain to someone where to click to get a certain option, particularly when spreadsheet softwares are not consistent across platforms.
So what to do with the mistake I made? We can simply go back to our notebook and re-run the steps before I made the mistake, then fix the inequality in the equation.
In this chapter, we have discussed how to programmatically access and read data using the Pandas library. We have used a variety of commands to perform simple data accession, as well as more complex choices involving multiple columns and rows. We even made a mistake in the data that we accessed - and, in the process, learned that we can use the programmatic record to reproduce the steps before we made our mistake to get our data back.
We've now learned a bit about how to read in data and access data in using R. This chapter will guide you through some common operations that researchers often want to do to with data using R. We will then take a step back and discuss how to use common programming conventions, like loops and lists to make automating boring tasks easier.
After finishing this chapter, you will usderstand:
One common task many researchers do with their data is break it into smaller chunks, perhaps by treatment, to look at effects in that group. We're not working with experiemental data here, but we can certainly split our data up. Ants are known in the literature as the Formicids. This is a family. There are many subfamilies of ants. We might be interested in seeing if we have equal samples of each subfamily.
We can group our data by subfamily like so:
subfamilies <- ant_data %>% group_by('subfamily')
Doing this creates a groupby object, or a list of chunks of our data, identified by the subfamily to which they belong.
subfamilies <- ant_data %>% count(subfamily) subfamilies
They are not. The samples of Formicinae and Dolichoderinae, for example, are much bigger than the rest. These are also the most specious groups of ants. Could the disparity in the sample be due to group speciousity? I've made a secodn spreadsheet with a rough guess at the species richness of each of the ant groups in this data set. Let's load it:
sub_r <- read_csv('data/subfamilyRichness.csv')
You'll notice we've added a new option to reading in a csv file - index_col. This allows us to specifiy which column we would like to be the index. Now we have two dataframes, one with our observed data and one with the number of species that exist in nature. To do mathematical operations using both pieces of data, we will combine them. We will do this using a merge, which combines two datasets on a given column.
In our case, we will merge sizes to sub_r. We will use the nice indices we just created as our join columns. This is specified by 'left_index', the index of the object before we call the merge function being set to true, and right_index, the index of the object after we call the merge being set to true. If we had called join with sub_r being first, it would be the left and sizes would be the right.
merged <- merge(subfamilies, sub_r)
The output of this is a new dataframe, with two columns: one from sizes, one from sub_r. We can now divide out our two columns to see if we have sampled some clades more because they are more speciose. If this were true, we would expect to see roughly the same proportion of sampling, but not the same number of samples. We will assign the sampling proportion to a new column called 'proportion'.
merged %>% mutate(proportion = n/size)
Do these sizes look right? Your answer should be no. There might be some errors in this database.
Challenge
- Try some other mathematical operations. Subtract the columns!
- How could you delete a column if you decided you didn't want your mathematical output? (Hint: look at the drop function. Be aware some behaviors are a little odd.)
- Think a little about data as read-only: where will you want to save these outputs?
## Recap
In this chapter, we have looked at subsetting our data. We used the groupby command to make data subsets along a biologically interesting axis. We then used loops and lists to process data and make the process of managing datasets easier. Finally, we joined together multiple data objects and used them to perform mathematical operations. In the next chapter, we will build on the concepts seen here to further automate data management.
So far, we've covered the nuts and bolts of programming. We've learned a little bit about how to control R, and it's time to think about controlling its environment. As you program more and more, you'll need to have maintainable pieces of code, and some framework to maintain them.
In this chapter, we'll cover two core concepts. We will discuss scripts, which may contain one or more functions and the comments needed to interpret and use them. And lastly, we will discuss revision management, the practice of using a defined system to track how and when changes are made to your code.
By the end of this chapter, you will be able to:
Now we have an actual, working script. This is fantastic - we can send this to a coworker and have them reproduce our analyses. We can use it again later to recall exactly what we did!
Until our coworker emails it back and says it doesn't work, and we open it to find a bunch of changes. Until we accidentally delete a piece of code, close our laptop, and go on vacation. Until our laptop dies.
Enter revision management, for tracking changes to our code (or any other text file - this tutorial was written with revision management). As you're aware, we will be using git for revision management this semester. What we're going to do now is a really swift primer on some of the basic functionality of git and its web interface, GitHub.
First, we'll do some brief setup:
``` git config --global user.name "Your Name" $ git config --global user.email "you@youremail"
When you do the setup, use the email address you set up GitHub with. That way, GitHub will know the commits are from you and won't reject them (I have added you as collaborators on my project). I really want you to protect the script we just wrote together. First, add your initials to the name of your script so I know whose is whose. Mine would be first\_script\_amw.py. Now, in the Spring2017 directory, type:
git add first_script_yourinitials.py
This command lets git know that we're interested in tracking any changes made to this script. Next, we want to commit the script. Committing takes a snapshot of the script, preserving it as it existed in that moment in time. People differ on how often to commit. Some people commit frequently, some people only commit when they are done for the day.
git commit
This will now ask you for a short message describing what is being committed. You could say something like 'adding ant data parsing script' or 'initial draft of script'. Try for something informative. Now, we push to the internet: ``` git push
This may ask you for your username and password, if you cloned the repository over HTTPS. Once this has completed, you should be able to go to the website for the repository and look at your script!
You might be wondering now where git is storing your log of revisions. If you go up one level like so:
```
cd ..
and list hidden files: ``` ls -a
you will see a folder called .git. This stores all your checkpoints and snapshots.
We'll cover more complex Git functionality in the future as we do more work.
Challenge
Try to find a tpyo to fix in the lessons. Fix it, add the file in git, commit it and push it.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.