In mzinkgraf/BiometricsWWU: BioStats at WWU

Basic Probability

We will go over a few examples, to illustrate how you can use R to help you with probabilities.
In this lab, I want you to work from a script file (if you don't know what I mean, then please ask!).

`prob` Package Info

In R, there is a package called prob that we can use to help us answer questions about probability. So, let's install and load the prob package.

You can find information about installing and loading packages in the Intro to R: Packages link.

require("prob")

Athough there are a bunch of functions in the package, we will focus on a few.

iidspace(): Creates a sample space with the associated probabilities.
subset(): Creates a new set from another set.
union(): Creates a new set that is the union of two sets.
intersect(): Creates a new set that is the intersection of two sets.
setdiff(): Creates a new set that includes everything in the first set but not the second set.
Prob(): Calculates the probability for a set or a conditional probability of one set given another set.

You can find more information about the prob package at the following link web page.

Crocodile Eggs

Recall from lecture the questions about the probability of crocodile eggs hatching. Below we will work through a few examples that are in the lecture.

To quickly create a sample space and associated probablities, we can use the function iidspace(). We can include the outcomes, the number of trials, and the probabilities of each outcome.

Let's create the sample space for the situtation in which we have 3 eggs. Assume that each egg can only hatch or not hatch, and the probability of hatching is 0.8 and the probability of not hatching is 0.2. You are welcome, and encouraged, to modify this example.

egg.ss <- iidspace(c("Hatched", "Not Hatched"), ntrials = 3, probs = c(0.8, 0.2))
egg.ss

Wow, that was easy!!!! Of course you need to understand how to calculate each probability by hand. So, now calculate by hand each of these proabilities. If you don't recall how, then look at your notes or ask me for help.

You can now subset the sample space (i.e., create new sets).

The first new set, which I will call f, is all the outcomes in which the first egg hatched. So, that is rows 1, 3, 5, 7. The second set, which I will call t, is all the outcomes in which two or more egg hatched. So, that is rows 1, 2, 3, 5. Let's create those subsets now.

f <- egg.ss[c(1,3,5,7), ] #Alternatively you could use the function subset, see below.
f

t <- egg.ss[c(1,2,3,5), ] #Alternatively you could use the function subset, see below.

For the rest of this example, we will use the two sets we just created. However, below are some examples of selecting sets using the function subset(). In particular, we are going to use "or" and "and" statements to create new sets. In R, you indicate "or" with a vertical bar | and "and" with an ampersand &.

#Select all the cases in which the first egg hatched
#Same as the set f we created above
subset(egg.ss, X1 == "Hatched")

#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
subset(egg.ss, 
            X1 == "Hatched" & X2 == "Hatched" | 
            X2 == "Hatched" & X3 == "Hatched" |
            X1 == "Hatched" & X3 == "Hatched")

#Select all the cases in which the second and third eggs did not hatch
subset(egg.ss,X2 == "Not Hatched" & X3 == "Not Hatched")

#Select all the cases in which at least one egg hached
subset(egg.ss, X1 == "Hatched" | X2 == "Hatched" | X3 == "Hatched")

Here is a more advanced example to illustrate how you can create more flexible ways to create sets. You are welcome to ignore the select_set example, but you are doing well if you understand it. We first create a function with the arguments set (the set you want to selection from), selection_number (the number of successful trials that you want to select), selection_name (the name of a success in the set), and , n_trials (the number of trials in the set). I have included default values so the function will return 2 or more eggs that hatched. The function uses a for loop to repeat an operation, in this case determine whether there are two or more eggs that hatched in each row. It then returns all the rows from the set the meet the our criteria. Please ask if you have questions.

#Our homemade function to create a subset
select_set <- function(set, selection_number = 2, selection_name = "Hatched", n_trials = 3) {
  in.t <- c()
  for (i in 1:dim(set)[1]) {
    in.t[i] <- sum(set[i, 1:n_trials] == selection_name) >= selection_number
  }
  return(subset(set, in.t))
}

#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
select_set(egg.ss)

#Select all the cases in which 1 or more eggs did not hatched
select_set(egg.ss, 1, "Not Hatched")

We can now determine the union and intersection of two sets, f and t. Remember to think about the answer before you run each function. You should not ignore this.

u <- union(f, t)
u
i <- intersect(f, t)
i

When you load the prob package, it loads several functions that mask existing functions in R. Two of these are union() and intersect(). Beware that the union() and intersect() functions that load when you start R are not the same as the union() and intersect() that load when you load the prob package.

There is also a function setdiff() which gives the outcomes in the first set but not the second. Notice that which set you put first matters.

setdiff(f, t)
setdiff(t, f)
not.t <- setdiff(egg.ss, t) #Gives t^c
not.f <- setdiff(egg.ss, f) #Gives f^c

You can use the function Prob() (notice the P is capitalized) to calculate the probabilities of these sets.

Prob(f)
Prob(t)
Prob(u)
Prob(i)
Prob(not.t)
Prob(not.f)

The function Prob() can also calculate conditional probabilities.

Prob(i)/Prob(f) #Formula for conditional probs
Prob(t, given = f)
Prob(f, given = t)

Genetics Example

Now let's do a question that Dr. Trent might ask you in Genetics. There is a population of rabbits with three alleles for a particular gene. Let's call the alleles a, b, and c. In the population there are 560 copies of a, 105 copies of b, and 35 copies of c.

From these frequencies, calculate the allelic proportions for each allele (you learned this in Biol 204).

Now assume Hardy-Weinberg equilibrum and answer the following questions.

What is the expected proportions of each genotype.
Assuming that a is dominant over the other two alleles, what is the probability that an individual randomly pulled from the population will have an "a" phenotype?

Try to graph the data with a bargraph or pie chart.

Ladybird Beetle Example

In this example, we are going to import data from a csv file. You can find the file, named "ladybirddata.csv", on Canvas. Save it, and remember the location. Now import the data into R. This is a standard format for representing categorical data. Each row represents an observation.

#set your working directory
#use dir() to see files in the working directory

data <- read.csv("../inst/extdata/ladybirddata.csv")
str(data)

We can use the functon table() to quickly calculate the frequencies. Let's calculate how many individuals had spots for each species.

table(data$Species, data$Spots)

We can use the with() function if we don't want to type the name of the data.frame multiple times. The first argument is a data.frame, and the second is the code that want to execute with that data.frame. Notice I only have to type Species and not data$Species. The following code produces the same result as above.

with(data, table(Species, Spots))

How would you determine how many individuals have a given background color for each species?

We can also ask R to give us more specific information. For example, we can use the code below to determine the number of individuals with a given color of spots and a given background for each species.

with(data, table(BG_Color, Spot_Color, Species))

Change the order of the column names, and see what happens.

We can use the function prop.table() to calculate the proportions, which are also the probilities, from the frequencies.

prop.table(with(data, table(Species, Spots)))

Practice calculating the frequencies and proportions for other combinations of characters, like background color and spot color. You should also practice making barplots and pie charts to represent these data.

So far, all the functions we have used in the Ladybird Beetle example are loaded when you start R. In other words, they are not in a special package. There is also another useful function empirical() in the prob package that will calculate the proportions for each of the combination of characters in the data.

empirical(data)

Playing Around

Here are some examples of functions in the prob package that you might find helpful for practicing R and learning probability.

There are several functions that will quickly generate some common sample spaces.

tosscoin: Creates a sample space from flipping a coin.
rolldie: Creates a sample space from rolling dice.
cards: Creates a sample space representing a deck of cards.
roulette: Creates a sample space representing a roulette table.
urnsamples: Creates a sample space from pulling items from a jar.

Let's create a few sample spaces now.

Create the sample space for flipping a coin 6 times. For all the examples below, I use the head function so I don't print out the entire sample space (which are large in a few examples). These functions have an argument makespace that will also add the probabilities of each outcome.

coin.ss <- tosscoin(6)
head(coin.ss)
coin.ssp <- tosscoin(6, makespace = T)
head(coin.ssp)

Create the sample space for rolling 2 dice. This sample space is used to model the game craps.

dice.ss <- rolldie(2)
head(dice.ss)
dice.ssp <- rolldie(2, makespace = T)
head(dice.ssp)

Create the sample space for a deck of card that includes the jokers.

cardDeck <- cards(jokers = T)
head(cardDeck)
cardDeckp <- cards(jokers = T, makespace = T)
head(cardDeckp)

Create the sample space for a roulette table.

roulTable <- roulette()
head(roulTable)
roulTablep <- roulette(makespace = T)
head(roulTablep)

We can use the function urnsamples() to create additional sample spaces.

Create the sample space for pulling M&Ms from a jar.

mms <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T, order = T)
head(mms)

You can also create the sample space in which the order does not matter.

mms2 <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T)
head(mms2)

You can create the hands in a poker game with 2 card hands. You might notice that there are 1,326 different hands with only two cards! A 5 card hand will take a long time because there are so many different possible hands.

pokerHand <- urnsamples(cards(), 2)
head(pokerHand)

Now let's learn why it is a bad idea to play roulette. Roulette tables pay 1 to 1 odds, which means that if you bet 5 dollars on red and red hits, then the dealer gives you your 5 dollars back plus another 5 dollars. The probability of winning with 1 to 1 odds is 0.5, or 50% of the time. So, if you won 50% of the time, then this game would be a fair one. Pretty sweet, right? No, and let's see why.

We created our sample space of the roulette table above. So, let's create the set of hitting red.

red <- subset(roulTablep, color == "Red")
Prob(red) #Prob of winning
1 - Prob(red) #Prob of losing

Notice that the probablity of winning is actually less than 0.5, which means that on average you will lose more than you win--yet they pay you as if your chances were 50%.

Now let's become a serious gambler. We are going to spend all day at the roulette table. Fortunatley, we can simulate this without spending a dime, which is good thing because we are likely to lose! Let's play the game 2,000 times. You will get a different set of results than what is below because each spin of the table is random.

playingRoulette <- sim(roulTablep, ntrials = 2000)
head(playingRoulette)

Now let's figure out how we did. Let's assume we played 5 dollars on red each time.

wagered <- 5*2000
#nrow() returns the number of rows
earnings <- 10 * nrow(subset(playingRoulette, color == "Red")) #10 because $5 for bet and $5 for winnings
(winnings <- earnings - wagered)