knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
#library(lterdatasets) library(vcrfish) library(tidyverse)
An initial cleaning of the VCR fish dataset:
vcr_fish <- vcr_fish %>% mutate(sample_date=as.Date(sample_date, "%Y-%m-%d")) %>% mutate(sample_year=as.integer(format(sample_date, format = "%Y")), sample_month=as.integer(format(sample_date, format = "%m")), sample_day=as.integer(format(sample_date, format = "%d")) ) vcr_fish <- vcr_fish %>% mutate(sample_hour=as.integer( format(as.POSIXct(sample_time, format = "%H:%M"), format = "%H") ), sample_min=as.integer( format(as.POSIXct(sample_time, format = "%H:%M"), format = "%M") ) )
Let's take a look at the variables within this dataset:
vcr_fish %>% head() sapply(vcr_fish, typeof)
Let's take a closer look at differences between fish species to learn more.
vcr_fish %>% distinct(species_name)
Before continuing to any exploratory analyses, it is always important to check the sample size of the variables in question.
# Count the number of names names_count <- table(vcr_fish %>% select(species_name)) names_count
There is a large disparity in the recorded fish species above--there are 442 rows of data collected on Pipefish, whereas there are only 2 rows collected on the Pufferfish. In the code below, I am filtering the fish species type to have greater than 200 samples to ensure that there is a sufficient sample size in our analyses later on.
# Subset the orginial dataset with only the species names specified by the line above. vcr_fish2 <- subset(vcr_fish, species_name %in% names(names_count[names_count > 200]))
Now, we have 5 different fish species (pipefish, silversides, pinfish, silver perch, and bay anchovy) with more than 200 rows of data.
vcr_fish2 %>% distinct(species_name)
A five number summary is a statistical summary that includes the minimum, maximum, median, mean, first quartile, and third quartile values of a given dataset. Let's calculate the five number summary for the fish length of all the species sampled.
vcr_five_number_sum <- summary(vcr_fish2 %>% select(length)) vcr_five_number_sum
Upon a closer look, we see that our IQR = 95 - 50 (Q3 - Q1) = 45 inches, where 50% of the data sits inside these two quartiles and 50% sits outside of these quartiles. This is a large disparity, so lets create five number summaries for each fish species individually.
Let's view the five number summary for all 5 fish species.
# Create separate datasets for all 5 species pipefish <- vcr_fish2 %>% filter(species_name=="Pipefish") silversides <- vcr_fish2 %>% filter(species_name=="Silversides") pinfish <- vcr_fish2 %>% filter(species_name=="Pinfish") silverperch <- vcr_fish2 %>% filter(species_name=="Silver Perch") bayanchovy <- vcr_fish2 %>% filter(species_name=="Bay Anchovy") # Individual five number summaries pipefish_sum <- summary(pipefish %>% select(length)) silversides_sum <- summary(silversides %>% select(length)) pinfish_sum <- summary(pinfish %>% select(length)) silverperch_sum <- summary(silverperch %>% select(length)) bayanchovy_sum <- summary(bayanchovy %>% select(length)) pipefish_sum silversides_sum pinfish_sum silverperch_sum bayanchovy_sum
Boxplots are a graphical and visual way to display numerical data through four quartiles. This visualization is advantageous because it easily shows a given variable's variability, skewness, outliers, and symmetry. Additionally, boxplots help to visualize the five number summary that we calculated above.
Let's graph a boxplot of the filtered fish data to look at initial trends in the data.
vcr_fish2 %>% ggplot(aes(x=species_name, y=length)) + geom_boxplot()
It appears that the pinfish species has an outstanding outlier. Let's go ahead and remove this outlier just so that we can view the boxplots at a closer level.
pinfish_outlier <- vcr_fish2 %>% slice_max(length) #store it in case vcr_fish3 <- vcr_fish2 %>% filter(length!=pinfish_outlier$length)
Let's add color to our boxplots to differentiate fish species from each other.
vcr_fish3 %>% ggplot(aes(x=species_name, y=length, color=species_name)) + geom_boxplot()
Additionally, we can add a main title and alter the axis' titles.
ggplot(data = vcr_fish3, aes(x=species_name, y=length, color=species_name)) + geom_boxplot() + labs(title = "Boxplots of VCR Fish Species", x = "Species Name", y = "Length") + theme(plot.title = element_text(hjust = 0.5))
Let's reorder the boxplots in an ascending order.
vcr_fish3 %>% ggplot(aes(x = fct_reorder(species_name, length), y = length, color = species_name)) + geom_boxplot() + labs(title = "Boxplots of VCR Fish Species", x = "Species Name", y = "Length") + theme(plot.title = element_text(hjust = 0.5))
Overall, we see that pipefish as a whole species are longer than the other fish species given they have the highest median value around 140 and have the greatest range compared to the other fish plotted above. Reversely, bay anchovies have the smallest range out of all the fish, while pinfish have the lowest median length value of 49. We should also reiterate that the pinfish species had the largest outstanding outlier that was removed from the boxplots.
McGlathery, K. 2019. Fish Counts and Lengths in South Bay and Hog Island Bay, Virginia 2012-2018 ver 11. Environmental Data Initiative. https://doi.org/10.6073/pasta/af32b580d4646f4b35d7552d83e9b09b (Accessed 2021-03-18).
r knitr::spin_child(here::here("data-raw","vcr_fish_data.R"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.