knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(MATH4753cast0059) library(dplyr)
The package includes a variety of functions related to statistics which were made and used in MATH 4753, Applied Statistical Methods. This includes a variety of functions which help analyze probability distributions, a function to easily import data, functions for sampling populations, bootstrapping, and more.
This function reads .csv data into the environment. It can make it easier to draw data from a folder which is not in the local directory of the project or script.
This is most useful when there are multiple files you wish to draw from in a data source not in the local directory. This avoids the unnecessary hassle and waste of storage that copying them over would entail.
It returns the data in the form of a table.
To demonstrate I will create a temporary file and read it into the global environment. This is done using an empty string for the directory and the output of the tempfile() function for the name of the .csv.
data(firedam) test1 <- as.character(firedam$DISTANCE) tf <- tempfile() writeLines(test1, tf) dat = MATH4753cast0059::myread('', tf) dat summary(dat)
As you can see, the data is read into r as a data frame and can be manipulated as any other.
mybin() assumes a binomial probability distribution and determines the most likely number of successes in each iteration given a probability. The function adds the number of iterations that received each number of successes and divides by the overall number of iterations, ending in a plot and table which give the relative frequency of each result.
This could be useful in applications where you would like to visualize the distribution of some kind of normally distributed binary variable. This could be something like a series of coin flips.
Let's assume that we have a large population of binomial results with half being TRUE and half being FALSE, much like a series of coin flips. We would like to take 10 samples, or 10 coin flips, from this population and record the number which were TRUE. We would like to repeat this 100 times and combine the results to obtain our final distribution. We would apply mybin() in the following way:
MATH4753cast0059::mybin(iter = 100, n = 10, p = 0.5)
mybin() tells us that the binomial distribution could be approximated with a normal distribution centered about a mean of 5 successes. If we were to increase the number of iterations this would become even more apparent, like so.
MATH4753cast0059::mybin(iter = 10000, n = 10, p = 0.5)
mymult() assumes a multinomial distribution and finds the most likely number of successes given an array of probabilities, 1 per option. The function adds the number of iterations that received each number of successes, divides by overall number of iterations, and represents the distribution as a percentage in a relative frequency format.
This could be used in applications where there is a categorical variable with multiple options which have differing probabilities. An example of this could be responses to a multiple choice survey question or region of respondents residence in the United States.
To demonstrate this function I will assume that there is a categorical variable with four different possible values, each with its own probability. We could draw from a population of responses to a multiple choice question and determine their probabilities. We would like to visualize their distribution using mymult. In this example we will examine the likelihood of collecting a given fish species in a river in the ddt dataset. We would apply the function in the following way:
data(ddt) table(ddt$SPECIES, ddt$RIVER) ddt %>% filter(RIVER == 'TRM') %>% group_by(SPECIES) %>% tally() -> p table(p) p$SPECIES p$n/sum(p$n) MATH4753cast0059::mymult(iter=100,n=10, p=p$n/sum(p$n))
This shows our multinomial distribution and allows us to make some conjectures about a larger data set than ddt. We get a good visual of some of the possible results given the number of fish of each species captured.
This is a function which shows the hypergeometric distribution of a sample. It shows probability of each number of successes while sampling without replacement.
The hypergeometric distribution can be used to model binomial sampling without replacement. This is useful when the sample size relative to the population is large as sampling without replacement or when there are relatively small number of TRUE values corresponding to a low probability. This is useful in applications like modeling the probability of a defective part in an assemlby line.
We have observed two faulty parts due to faulty machinery on an assembly line out of a run of 100. We would like to model the probability distribution for faulty parts on our line through 100 runs to determine the right number we should sample to detect at least one. We will sample 5 parts from each simulated run. We would apply myhyper() in the following way:
MATH4753cast0059::myhyper(iter=100,N=100,r=2,n=5)
myhyper() indicates that sampling 5 parts from the line is likely not enough to detect faulty machinery as, assuming two defective parts in a run, we will only catch one of the parts a fraction of the time. Let's see what a greater sample size would look like.
MATH4753cast0059::myhyper(iter=100,N=100,r=2,n=25)
If we sample a quarter of the parts and we assume two faulty parts in a run then we will catch at least one about 45 or 50% of the time. This is much better and allows faults in equipment to be caught relatively quickly and prior runs to be examined.
This function takes n samples each iteration from the population with replacement and shows the distribution for each sample.
This could be useful as a way to sample data by index and visualize which indices were selected. It should follow a roughly uniform distribution as all probabilities are equal.
We have a sample of 10 values from a population and would like to randomly select 5 with replacement for further analysis. We could apply our function in the following way:
data(spruce) sam = MATH4753cast0059::mysample(n = 15, N = length(spruce$BHDiameter), iter = 5 ) sam
We could then extract the data at those indices to get our desired random sample.
This function uses a given mean and standard deviation to create a normal distribution, calculates the probability of a result below a which shows as shaded region in the plot, and prints the associated value to the console.
This function could be useful when trying to visualize a normal distribution with known mean and standard deviation to see what the probability is for a value to occur in a given region. An example could use student grade data and determine the percentage of students which scored lower than c.
Following the use case I gave earlier, let's assume that a class of students had an average of an 80% on an exam with a standard deviation of 7%. We would apply myncurve() in the following way:
myncurve(mu = 80, sigma = 7, a = 70)
myncurve() shows a plot visualizing the data and tells us the probability we were looking for, about 8% of students in the class scored lower than a c on the exam.
This function subsets the ddt dataframe by a species given as a parameter. It generates a corresponding plot, df before and after subsetting, and a relative frequency of river table.
This is mostly applicable to the ddt dataframe, a dataframe of data on fish length, weight, concentration of DDT, and more.
data(ddt) out = myddt(ddt, 'CCATFISH') out$plot out$`relative frequency of river table` head(out$dataframe) head(out$`subsetted dataframe`)
It gives us a nice plot of length and weight of the selected species in each river, a relative frequency table for each river, and the subsetted dataframe.
This function applies the central limit theorem to the poisson distribution and displays a histogram of the means distribution density. This gives us the mean of each iteration it conducts with a given lambda and sample size and a relative frequency plot for each result.
The poisson distribution can help us predict the occurrence of rare, unrelated events in a unit of time, area, or volume. By predicting the mean we can figure out the number of events to expect. This could be applied to something like the number of calls recieved in an hour at a call center.
If we have a sample of 5 calls received in an hour and would like to expand this data to cover other possibilities we might use the mycltp function like so:
par(mar=c(3,3,3,0)) MATH4753cast0059::mycltp(n = 5, iter = 100, lambda = 10)
These plots show us that 10 is the most likely number of occurances but that it is likely to have a fairly significant range.
This function is useful in applying a function to data through the method of bootstrapping, where it extrapolates the data to create iter number of new samples which it calculates the result of the function along with confidence intervals.
This could be used for really any sample of real data which ones wishes to extrapolate to a larger data set.
If we wish to find the mean length of catfish in the TRM river of the DDT dataset then we would want to apply myboot2 to ddt like so:
data(ddt) ddt %>% filter(RIVER == 'TRM', SPECIES == 'CCATFISH') %>% select(LENGTH) -> sam MATH4753cast0059::myboot2(iter=10000,sam$LENGTH,fun="mean",alpha=0.05,cx=1.5)
We can see that the average length is 44.68 with a fairly small confidence interval so if this sample is representative of the population we can be fairly certain the mean will not change.
This function computes the maximum likelihood value for the mean and standard deviation of a sample of data, most likely a column of a data frame. It is most useful if the data is approximately normal in its distribution.
This could help determine the mean and standard deviation of data, with a constant expected value and variance, in a number of situations. One example could be the levels of carbon dioxide present in the air at a specific time or the breast height diameter of trees.
If we want to determine and visualize the mean and standard deviation of data, in this case breast height diameter of spruce trees, along with confidence intervals then we might apply mymlnorm() in the following way:
data(spruce) MATH4753cast0059::mymlnorm(spruce$BHDiameter,mu=seq(16,21,length=100),sig=seq(3.5,6.5,length=100),lwd=2,labcex=1)
ddt is a dataset containing the information on DDT concentration, length, weight, species, river caught in, and mile measurements of fish surveyed. This data was collected to determine the prevalence of DDT in their environment.
Name of the river the fish was caught in
Miles from mouth of river the fish was caught at
Length of the fish
Weight of the fish
Species of the fish caught
DDT concentration in the fish
data(ddt) head(ddt)
spruce is a dataset containing the height and breast height diameter of trees, both numeric variables. This data was collected in the hopes that breast height diameter is a good way to find an estimate for spruce tree height.
the height of the spruce tree
The breast height diameter of the spruce tree
data(spruce) head(spruce)
firedam is a dataset containing information on distance and damage of fires as a record of their severity and for inference for future fires.
The distance traveled of fire
The damage of fire
data(firedam) head(firedam)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.