knitr::opts_chunk$set(echo = TRUE) fig_caption=TRUE library(psych) library(ggplot2) library(NWCEd) library(gridExtra) library(datasets)
Term | Definition ---------------- | --------------------------------------------------- Box plot | A graphical display of the distribution of data for a given dataset Outlier | A data point that is distinctly separate from the rest of the data Mean | The mean or average used to derive the central tendency of the data in question Median | The middle value in a list values sorted from smallest to largest Standard Deviation | Used as a measure of the variation in a distribution Spread | Refers to how stretched or squeezed the distribution is Kurtosis | A measure of the "tailedness" of the probability distribution Skewness | A measure of the asymmetry of the probability distribution Quartile | The median of either the upper or lower half of a dataset after the dataset has been ordered and and separated into two groups Stakeholder | A person or group that has an interest or concern
Huc Layer:
to 12 Digit. Note that both 12-Digit and 8-Digit HUC ID's can work. For this exercise we will use a 12-Digit HUC ID. In the search bar, type "Denver" and then select "Denver County CO Denver County". Zoom out until you are able to clearly view the delineated watersheds. Select the watershed which largly encompasses Denver, Colorado as shown in Figure 2.
With the HUC ID obtained, we are ready to download the data and create our box plots. For this lab, the box plots have already been produced using the datasets associated with HUC #101900030304. For reproduction of graphs, please click on the Show Code
button below each graph. Embedded code can be run in the RStudio console. Information on the functions used can be found at https://github.com/NWCEd/NWCEd/tree/master/R. Before we jump into our analysis of ET and precipitation datasets, let's review how to interpret box plots. Below is an example of a box plot.
ggplot(cars, aes("var", speed)) + geom_boxplot(fill = "#E69F00", color = "black", outlier.colour = "red", outlier.shape = 16, outlier.size = 2) + xlab("") + ylab("Units") + scale_x_discrete(breaks = NULL) + ggtitle("Example") + annotate("text", x=.8, y=20, label = "3rd Quartile") + annotate("text", x=.8, y=16, label = "Median") + annotate("text", x=.8, y=13, label = "1st Quartile") + annotate("text", x=1.15, y=25, label = "Largest non-outlier value") + annotate("text", x=1.15, y=4.3, label = "Smallest non-outlier value") + annotate("point", x = 1, y = 27, colour = "blue", size = 3) + annotate("point", x = 1, y = 2, colour = "blue", size = 3) + annotate("text", x=1.057, y=27, label = "Outlier") + annotate("text", x=1.057, y=2, label = "Outlier") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
This box plot has been generated using the dataset called cars which is found in the preloaded library in RStudio called datasets. The box plot is broken into 4 sections or quartiles. The first quartile range is between the "Smallest non-outlier value" and the bottom line of the box labeled "1st Quartile". Each quartile contains 1/4 of the data from the dataset. The second quartile range is between the 1st Quartile and the Median. The third quartile range is between the Median and "3rd Quartile". And lastly, the 4th quartile range is between the 3rd Quartile and "Largest non-outlier value". The Outliers which are indicated by blue circles have been added to the plot artificially and the respective values do not belong to the original dataset. Identifying outliers can help improve the quality of a given dataset. Now we are ready to look at the box plots for ET and precipitation for the HUC ID 101900030304.
The HUC ID is entered as an argument into the getNWCData function from the NWCEd package which brings in the hydrologic datasets associated with the specified watershed from the NWC-DP. The annualize function from the NWCEd package is then used to convert the dataset from daily to annual values. We can now use a plotting function in the ggplot2 library to create desired box plots for statistical analysis. The box plots of annual ET and annual precipitation for HUC ID 101900030304 are shown below with a printed summary of statistics performed for the respective plots.
getdata <- getNWCData(huc="101900030304") getetdata <- getdata[[1]] annualgetetdata <- annualize(getetdata, method = "sum") ggplot(annualgetetdata, aes("var", data)) + geom_boxplot(fill = "#009E73", color = "black", outlier.colour = "red", outlier.shape = 16, outlier.size = 2) + xlab("") + ylab("ET (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("ET for HUC 101900030304")
describe(getetdata[[2]])
getdata <- getNWCData(huc="101900030304") getprcpdata <- getdata[[2]] annualgetprcpdata <- annualize(getprcpdata, method = "sum") ggplot(annualgetprcpdata, aes("var", data)) + geom_boxplot(fill = "#56B4E9", color = "black", outlier.colour = "red", outlier.shape = 19, outlier.size = 2) + xlab("") + ylab("Precipitation (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("Precipitation for HUC 101900030304")
describe(annualgetprcpdata[[2]])
What are the 1st and 3rd quartiles? Are there any outliers? What is the scale on this plot? Is it different from the last plot? Because box plots are simplistic in nature, it is important to remember the little details such as the scale. For easier comparison, let's look at the ET box plot and the precipitation box plot side by side.
annualetgraph <- ggplot(annualgetetdata, aes("var", data)) + geom_boxplot(fill = "#009E73", color = "black", outlier.colour = "red", outlier.shape = 16, outlier.size = 2) + xlab("") + ylab("ET (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("ET for HUC 101900030304") annualprcpgraph <- ggplot(annualgetprcpdata, aes("var", data)) + geom_boxplot(fill = "#56B4E9", color = "black", outlier.colour = "red", outlier.shape = 19, outlier.size = 2) + xlab("") + ylab("Precipitation (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("Precipitation for HUC 101900030304") grid.arrange(annualetgraph, annualprcpgraph, ncol = 2)
Are the median lines centered between the lowest non-outlier to the highest non-outlier in either of the plots? This may be difficult to see. Whenever there is not a perfect balance of data above and below the median line, it is said that the data are skewed. Let's look at a normal distribution curve as shown below.
x <- seq(-4, 4, length=100) hx <- dnorm(x) plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Normal Distribution")
ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 3, colour = "black", fill = "forestgreen") + ggtitle("Annual ET Histogram") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
library(ggplot2) library(gridExtra) # Plot of annualethist1 with binwidth = 0.5 annualethist1 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 0.5, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 0.5") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Plot of annualethist2 with binwidth = 1 annualethist2 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 1, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 1") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Plot of annualethist3 with binwidth = 3 annualethist3 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 3, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 3") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Plot of annualethist4 with binwidth = 6 annualethist4 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 6, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 6") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Plot of annualethist5 with binwidth = 10 annualethist5 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 10, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 10") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Plot of annualethist6 with binwidth = 15 annualethist6 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 15, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 15") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) grid.arrange(annualethist1, annualethist2, annualethist3, annualethist4, annualethist5, annualethist6, ncol = 2)
As you can see, changing the band width can drastically change the appearance of the plot. It can become very subjective as to which band width should be used. Therefore, we need a different solution. Below is a plot of a density curve over a histogram.
ggplot(annualgetetdata, aes(x = data)) + geom_histogram(aes(y = ..density..), binwidth = 3, colour = "black", fill = "white") + geom_density(alpha=.5, fill="#009E73") + geom_vline(aes(xintercept = 310.00), color="red", linetype = "dashed", size = 2) + ggtitle("Density vs Histogram Plot") + geom_vline(aes(xintercept = 320.20), color="orange", linetype = "solid", size = 2) + geom_vline(aes(xintercept = 330.46), color="purple", linetype = "solid", size = 2) + annotate("text", x=312.3, y=-.003, label = "50%") + annotate("text", x=322.5, y=-.003, label = "90%") + annotate("text", x=332.8, y=-.003, label = "99%") + xlab("mm ET") + ylab("Density")
ggplot(annualgetprcpdata, aes(x = data)) + geom_histogram(aes(y = ..density..), binwidth = 12, colour = "black", fill = "white") + geom_density(alpha=.5, fill="#009E73") + geom_vline(aes(xintercept = 378), color="red", linetype = "dashed", size = 2) + ggtitle("Density vs Histogram Plot") + geom_vline(aes(xintercept = 466.80), color="orange", linetype = "solid", size = 2) + geom_vline(aes(xintercept = 499.96), color="purple", linetype = "solid", size = 2) + annotate("text", x=390.5, y=-.0005, label = "50%") + annotate("text", x=479.5, y=-.0005, label = "90%") + annotate("text", x=512, y=-.0005, label = "99%") + xlab("mm ET") + ylab("Density")
The 50th, 90th, and 99th percentiles for this dataset are 378.00, 466.80, and 499.96 mm of annual precipitation, respectively. This means that in a given year for this particular watershed, there is a 10% chance that there will be more than 378 mm of rain. There is a 1% chance of more than 499.96 mm of rain falling. It is important to note that the density curve for this plot was produced using an averaging of the data. This method works very well for large datasets. For small datasets, there are alternative methods which will be discussed in Lab 5.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.