classdata: Real-Life Data Used in Course Work

#' Ames housing data 2017 to 2022
#'
#' A dataset containing details on residential sales in Ames between Jan 1 2017 and Aug 31 2022.
#' This data is publicly available through the Ames Assessor at 
#' \url{https://www.cityofames.org/government/departments-divisions-a-h/city-assessor/sales}. 
#' The variables are as follows (more details can be found on the Ames website):
#' @format A data frame with 6935 rows and 16 variables:
#' \describe{
#'   \item{Parcel ID}{character with ID. }
#'   \item{Address}{property address in Ames, IA. }
#'   \item{Style}{factor variable detailing the type of housing.}
#'   \item{Occupancy}{factor variable of type of housing.}
#'   \item{Sale Date}{date of sale.}
#'   \item{Sale Price}{sales price (in US dollar).}
#'   \item{Multi Sale}{logical value: was this sale part of a package?}
#'   \item{YearBuilt}{integer value: year in which the house was built.}
#'   \item{Acres}{acres of the lot.}
#'   \item{TotalLivingArea (sf)}{total living area in square feet.}
#'   \item{Bedrooms}{number of bedrooms.}
#'   \item{FinishedBsmtArea (sf)}{total area of the finished basement in square feet.}
#'   \item{LotArea(sf)}{total lot area in square feet.}
#'   \item{AC}{logical value: does the property have an AC?}
#'   \item{FirePlace}{logical value: does the property have an fireplace?}
#'   \item{Neighborhood}{factor variable - levels indicate neighborhood area in Ames.}
#' }
#' @keywords datasets
"ames"


#' Numbers of crimes by state
#'
#' A dataset containing the number of property and violent crimes across the 
#' United States from about 1980 to 2020. 
#' The data was acquired through the API for the FBI's Crime Data Explorer 
#' at \url{https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/explorer/crime/crime-trend}. 
#' The variables are as follows (more detail on the FBI website):
#' @format A data frame with 19476 rows and 8 variables:
#' \describe{
#'   \item{state}{name of the state for which numbers are reported.}
#'   \item{state_id}{id for each state. }
#'   \item{state_abbr}{two letter state abbreviation. }
#'   \item{year}{year of the reporting.}
#'   \item{population}{state population.}
#'   \item{type}{type of crime.}
#'   \item{count}{number of reported crimes.}
#'   \item{violent_crime}{logical value: is this crime against a person?.}
#' }
#' @keywords datasets
"fbi"

#' Numbers of crimes by state
#'
#' A dataset containing the number of property and violent crimes across the 
#' United States as reported through the API for the FBI's Crime Data Explorer 
#' at \url{https://crime-data-explorer.fr.cloud.gov/pages/home}. 
#' The variables are as follows (more detail on the FBI website):
#' @format A data frame with 2164 rows and 16 variables:
#' \describe{
#'   \item{state}{name of the state for which numbers are reported.}
#'   \item{state_id}{id for each state. }
#'   \item{state_abbr}{two letter state abbreviation. }
#'   \item{year}{year of the reporting.}
#'   \item{population}{state population.}
#'   \item{violent_crime}{number of violent crimes. This number should be the sum of the next five variables.}
#'   \item{homicide}{number of reported murders.}
#'   \item{rape_legacy}{number of reported rapes before 2013. The definition of rape was updated in 2012 and reported afterwards as `rape_revised`.}
#'   \item{rape_revised}{number of reported rapes using the definition of 2012.}
#'   \item{robbery}{number of reported robberies.}
#'   \item{aggravated_assaults}{number of reported aggravated assaults.}
#'   \item{property_crime}{number of property crimes. This number should be the sum of the next four variables.}
#'   \item{burglary}{number of reported burglaries.}
#'   \item{larceny}{number of reported larceny thefts.}
#'   \item{motor_vehicle_theft}{number of reported motor vehicle thefts.}
#'   \item{arson}{number of reported incidents of arson.}
#' }
#' @keywords datasets
"fbiwide"

#' Numbers of crimes by state
#'
#' A dataset containing the number of property and violent crimes across the United States from 1960 to 2017. 
#' The data  from 1960 to 2014 was made available by the FBI in the Uniform Crime Reporting Statistics (UCR) at \url{https://www.ucrdatatool.gov/index.cfm}. 
#' From 2014-2019 the data is made available as part of Excel tables at \url{https://ucr.fbi.gov/crime-in-the-u.s/}.
#' This dataset is  now superceded by `fbi`.
#' The variables are as follows (more detail on the FBI website):
#' @format A data frame with 24088 rows and 7 variables:
#' \describe{
#'   \item{State}{name of the state for which numbers are reported.}
#'   \item{Abb}{two letter state abbreviation. }
#'   \item{Year}{year of the reporting.}
#'   \item{Population}{state population.}
#'   \item{Type}{type of crime.}
#'   \item{Count}{number of reported crimes.}
#'   \item{Violent.crime}{logical value: is this crime against a person?.}
#' }
#' @keywords datasets
"fbi.v1"

#' Numbers of crimes by state
#'
#' A dataset containing the number of property and violent crimes across the United States from 1960 to 2019. 
#' The data was made available by the FBI in the Uniform Crime Reporting Statistics (UCR) at \url{https://www.ucrdatatool.gov/index.cfm}. The variables are as follows (more detail on the FBI website):
#' This dataset is  now superceded by `fbiwide`.
#' @format A data frame with 3011 rows and 12 variables:
#' \describe{
#'   \item{State}{name of the state for which numbers are reported.}
#'   \item{Abb}{two letter state abbreviation. }
#'   \item{Year}{year of the reporting.}
#'   \item{Population}{state population.}
#'   \item{Aggravated.assault}{number of reported aggravated assaults.}
#'   \item{Burglary}{number of reported burglaries.}
#'   \item{Larceny.theft}{number of reported larceny thefts.}
#'   \item{Legacy.rape}{number of reported rapes before 2013. The definition of rape was updated in 2012 and reported afterwards (see below).}
#'   \item{Motor.vehicle.theft}{number of reported motor vehicle thefts.}
#'   \item{Murder}{number of reported murders.}
#'   \item{Rape}{number of reported rapes using the definition of 2012.}
#'   \item{Robbery}{number of reported robberies.}
#' }
#' @keywords datasets
"fbiwide.v1"


#' Numbers of crimes by state and source
#'
#' A dataset containing the state-wide counts of offenses for a selected number of crimes since 1979
#' as reported through the API for the FBI's Crime Data Explorer 
#' at \url{https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/docApi}. 
#' Last updated: Sep 20 2023<br>
#' The variables are as follows (more detail on the FBI website):
#' @format A tibble with 26520 rows and 12 columns
#' \describe{
#'   \item{state}{name of the state for which numbers are reported.}
#'   \item{state_id}{id for each state. }
#'   \item{state_abbr}{two letter state abbreviation. }
#'   \item{year}{year of the reporting.}
#'   \item{population}{state population.}
#'   \item{type}{type of crime.}
#'   \item{count}{number of reported offenses.}
#'   \item{total_agency_count}{total number of crime-solving agencies in the state.}
#'   \item{agency_submitting}{number of agencies who reported data.}
#'   \item{population_covered}{percent of the state's population covered by reporting agencies.}
#'   \item{source}{source of the estimate: SRS (standard reporting system) or 
#'                 NIBRS (national incidence based reporting system)}
#' }
#' @keywords datasets
#' @examples
#' # example code
#' library(tidyverse)
#' 
#' # compliance to report to NIBRS varies drastically by state
#' fbi.v2 %>% 
#'   ggplot(aes(x = year, y = agency_submitting/total_agency_count*100)) + 
#'     geom_point(aes(colour = source)) + 
#'     facet_wrap(~state, scales="free_y")
#'
#' # population size is related to compliance
#' fbi.v2 %>% 
#'   filter(year==2021) %>%
#'   ggplot(aes(x = population, y = agency_submitting/total_agency_count*100)) + 
#'     geom_point() +
#'     geom_text(aes(label=state_abbr), 
#'               nudge_y = 3,
#'               data = fbi.v2 %>% 
#'                      filter(year==2021, 
#'                             agency_submitting/total_agency_count*100 < 50 | 
#'                             population > 2e+07) %>% unique())
#'     
#' # comparison of SRS and NIBRS counts in Iowa across all types of offenses
#' fbi.v2 %>% filter(state_abbr=="IA") %>%  
#'   ggplot(aes(x = year, y = count)) + 
#'     geom_point(aes(colour = source)) + 
#'     facet_wrap(~type, scales="free_y")
#'
#' # comparison of SRS and NIBRS counts in New York across all types of offenses
#' fbi.v2 %>% filter(state_abbr=="NY") %>%  
#'   ggplot(aes(x = year, y = count)) + 
#'     geom_point(aes(colour = source)) + 
#'     facet_wrap(~type, scales="free_y")
"fbi.v2"

#' Numbers of crimes by state and source - wide format
#'
#' A dataset containing the number of property and violent crimes across the 
#' United States as reported through the API for the FBI's Crime Data Explorer 
#' at \url{https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/docApi}. 
#' Last updated: Sep 20 2023<br>
#' The variables are as follows (more detail on the FBI website):
#' @format A tibble with 9139 rows and 17 columns
#' \describe{
#'   \item{state}{name of the state for which numbers are reported.}
#'   \item{state_id}{id for each state. }
#'   \item{state_abbr}{two letter state abbreviation. }
#'   \item{year}{year of the reporting.}
#'   \item{population}{state population.}
#'   \item{total_agency_count}{number of agencies in the state.}
#'   \item{agency_submitting}{number of agencies involved in reporting to the system.}
#'   \item{source}{source of the estimate: SRS or NIBRS.}
#'   \item{homicide}{number of reported murders.}
#'   \item{rape_legacy}{number of reported rapes before 2013. The definition of rape was updated in 2012 and reported afterwards as `rape_revised`.}
#'   \item{rape_revised}{number of reported rapes using the definition of 2012.}
#'   \item{robbery}{number of reported robberies.}
#'   \item{aggravated_assaults}{number of reported aggravated assaults.}
#'   \item{property_crime}{number of property crimes. This number should be the sum of the next four variables.}
#'   \item{burglary}{number of reported burglaries.}
#'   \item{larceny}{number of reported larceny thefts.}
#'   \item{motor_vehicle_theft}{number of reported motor vehicle thefts.}
#'   \item{arson}{number of reported incidents of arson.}
#' }
#' @keywords datasets
"fbiwide.v2"

#' Numbers of crimes by city
#'
#' A dataset containing the number of property and violent crimes across Iowa from 2006 to 2016. 
#' The data was made available by the FBI in the Uniform Crime Reporting Statistics (UCR) at \url{https://www.ucrdatatool.gov/index.cfm}. The variables are as follows (more detail on the FBI website):
#'
#' @format A data frame with 1207 rows and 14 variables:
#' \describe{
#'   \item{City}{name of the city for which numbers are reported.}
#'   \item{Population}{state population.}
#'   \item{Violent.crime}{number of reported violent crimes.}
#'   \item{Murder}{number of reported murders.}
#'   \item{Rape}{number of reported rapes.  The definition of rape was updated in 2012. Numbers from 2013 reflect the new definition.}
#'   \item{Robbery}{number of reported robberies.}
#'   \item{Aggravated.assault}{number of reported aggravated assaults.}
#'   \item{Property.crime}{total number of reported property crimes.}
#'   \item{Burglary}{number of reported burglaries.}
#'   \item{Larceny.theft}{number of reported larceny thefts.}
#'   \item{Motor.vehicle.theft}{number of reported motor vehicle thefts.}
#'   \item{Arson}{number of reported cases of arsons.}
#'   \item{State}{name of the state for which numbers are reported.}
#'   \item{Year}{year of the reporting.}
#' }
#' @keywords datasets
"cities"


#' Data related to happiness from the general social survey.
#'
#' The data is a small sample of variables related to happiness from the
#' general social survey (GSS). The GSS is a yearly cross-sectional survey of
#' Americans, run since 1972. We combine data from more than 30 samples to yield over 70 thousand
#' observations, and of the over 5,000 variables, we select some variables that are related to
#' happiness:
#'
#' \itemize{
#'  \item year. year of the response, 1972 to 2018.
#'  \item age. age in years: 18--89 (89 stands for all 89 year olds and older).
#'  \item degree. highest education: lt high school, high school, junior
#'     college, bachelor, graduate.
#'  \item finrela. how is your financial status compared to others: far below, below average, average, above average, far above.
#'  \item happy. happiness: very happy, pretty happy, not too happy.
#'  \item health. health: excellent, good, fair, poor.
#'  \item marital. marital status:  married, never married, divorced,
#'    widowed, separated.
#'  \item sex. sex: female, male.
#'  \item polviews. from extremely conservative to extremely liberal.
#'  \item partyid. party identification: strong republican, not str republican, ind near rep, independent, ind near dem, not str democrat, strong democrat, other party.
#'  \item wtssall. probability weight. applicable to years up to 2018
#'  \item wtssnr. probability weight. applicable to years from 2004
#' }
#'
#' @keywords datasets
#' @name happy
#' @usage data(happy)
#' @format A data frame with 72390 rows and 12 variables
#' @examples 
#' library(dplyr)
#' library(ggplot2)
#' happy %>% 
#'   filter(!is.na(happy), !is.na(sex)) %>%
#'   ggplot(aes(x = factor(year), fill = happy)) + 
#'     geom_bar(position = "fill") +
#'     facet_grid(sex~.) +
#'     scale_fill_brewer(palette="Greens")
"happy"

#' Box office data
#'
#' The data contains weekly box office numbers as published on the website 
#' https://www.the-numbers.com/weekly-box-office-chart scraped on 
#' Nov 3 2022.
#'
#' \itemize{
#'  \item Rank current rank of the movie according to gross box office
#'  \item Rank.Last.Week last week's ranking of box office gross
#'  \item Movie name of the movie
#'  \item Distributor name of the Distributor
#'  \item Gross weekly box office gross in US dollars 
#'  \item Change percent change in gross from last week
#'  \item Thtrs. number of movie theaters the movie is being shown
#'  \item Per.Thr. per theater gross
#'  \item Total.Gross cumulative box office gross in 100 million US dollars 
#'  \item Week number of weeks a movie has been shown
#'  \item Date date of the publication of box office numbers
#' }
#'
#' @keywords datasets
#' @name box
#' @usage data(box)
#' @format A data frame with 46497 rows and 11 variables
"box"


#' Movie budget data
#'
#' The data contains movie budgets and box office gross (as much as they are known)
#' for about 6000 movies (as of Nov 8 2022) published 
#' on https://www.the-numbers.com/movie/budgets/all/1
#'
#' \itemize{
#'  \item Release Date (date) 
#'  \item Movie name of the movie
#'  \item Production Budget (dbl) budget in US dollars 
#'  \item Domestic Gross (dbl) in US dollars 
#'  \item Worldwide Gross (dbl) in US dollars 
#' }
#'
#' @keywords datasets
#' @name budget
#' @usage data(budget)
#' @format A data frame with 6341 rows and 5 variables
"budget"


#' Box office data from the Mojo website
#'
#' The data contains weekly box office numbers as published on the website 
#' https://www.boxofficemojo.com/weekend/chart/ scraped on 
#' Sep 10 2018.
#'
#' \itemize{
#'  \item TW rank this week
#'  \item LW rank last week
#'  \item Title name of the movie
#'  \item Studio name of the producing studio
#'  \item Weekend Gross weekend box office gross in US dollars 
#'  \item `% Change`` percent change in weekend gross from last week
#'  \item Theater Count number of movie theaters the movie is being shown
#'  \item Theater Change change in the number of theaters the movie was shown
#'  \item Average average gross per theater 
#'  \item Total Gross cumulative box office gross in US dollars 
#'  \item Budget (in Million) estimated budget
#'  \item Week week that the movie has been in theaters
#'  \item Weekend character string of the weekend of the show date
#'  \item Year integer of the year of the show date (between 2013 and 2018)
#'  \item WeekNo integer, week number of the year (1 through 52 or 53) 
#' }
#'
#' @keywords datasets
#' @name mojo
#' @usage data(mojo)
#' @format A data frame with 31718 rows and 15 variables.
"mojo"

#' Passengers and crew on board the Titanic
#'
#' A dataset containing some demographics and survival of people on board the Titanic
#' @format A data frame with 2201 rows and 4 variables:
#' \describe{
#'   \item{Class}{factor variable containing the class of a passenger (1st, 2nd, 3rd) or crew.}
#'   \item{Sex}{Male/Female.}
#'   \item{Age}{Child/Adult. This information is not very reliable, because it was inferred from boarding documents that did not state actual age in years.}
#'   \item{Survived}{Yes/No.}
#' }
#' @keywords datasets
"titanic"

#' Earthquake data
#'
#' The USGS monitors and reports earthquakes and earthquake-like events in almost real-time at https://www.usgs.gov/natural-hazards/earthquake-hazards. 
#' For more information on the variables see https://earthquake.usgs.gov/data/comcat/data-eventterms.php
#' 
#' \itemize{
#'  \item time date and time of the event in UTC
#'  \item latitude geographic latitude
#'  \item longitude geographic longitude
#'  \item depth approximate depth of the event
#'  \item mag magnitude of the event 
#'  \item magType method or algorithm used to calculate magnitude 
#'  \item nst total number of seismic stations used to determine earthquake location.
#'  \item gap largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake.
#'  \item dmin Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.
#'  \item rms Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.
#'  \item net ID of a data contributor.
#'  \item id unique identifier for the event
#'  \item updated time when the event was most recently updated
#'  \item place named geographic region near to the event
#'  \item type type of seismic event.
#'  \item horizontalError uncertainty of reported location of the event in kilometers.
#'  \item depthError uncertainty of reported depth of the event in kilometers.
#'  \item magError uncertainty of reported magnitude of the event.
#'  \item magNst  total number of seismic stations used to calculate the magnitude for this earthquake.
#'  \item locationSource network that originally authored the reported location of this event.
#'  \item magSource network that originally authored the reported magnitude for this event.
#' }
#'
#' @keywords datasets
#' @name earthquakes
#' @usage data(earthquakes)
#' @format A data frame with 22 variables
"earthquakes"