knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(dpjr)
This page is a listing of the raw data files in the {dpjr} package. The files can be accessed using the appropriate "read" function, with the package function dpjr_data()
providing the path. For example,
df_mpg <- read.csv(dpjr_data("mpg.csv"))
The full list of files is as follows:
dpjr::dpjr_data()
All of the data in this package is covered under an open license. For information about the specific license for each data set in this package, please refer to the vignette "Data licenses".
A pair of fixed-width files containing information about ten authors born in the United States of America:
"name"
"state of birth", the two-letter abbreviation for the US state in which the author was born
"unique_id", a randomly generated personal ID (similar to a national individual identification number used in different countries).
The variables are as follows:
Variable Width Start position End position -------- ----- -------------- ------------ name 20 1 20 state_of_birth 10 21 30 unique_id 12 31 42
Population and household estimates, England and Wales: Census 2021
Data is a tabulation of the population of England and Wales, drawn from the 2021 Census, and downloaded from the UK Office of National Statistics (ONS): Population and household estimates, England and Wales: Census 2021, 2022-06-28
Statistical bulletin: "Population and household estimates, England and Wales: Census 2021", 2022-06-28
"Local Authority Districts, Counties and Unitary Authorities (April 2021) Map in United Kingdom"
License
The UK Open Government License for Public Sector Information
The files within the subdirectory cr25
are as follows:
list.files(dpjr::dpjr_data("cr25"))
For information about these files see the vignette "Data generation—CR25".
A csv version of the dataset in the {gapminder} package
International Tuition Fees at Public Post-Secondary Institutions by Economic Development Region
Sourced from the British Columbia Data Catalogue, this Excel file contains "Annual Academic Arts Program tuition fees for full-time international students at public post-secondary institutions by Economic Development Region (EDR) and by institution. Academic Years 2011/12 to 2021/22."
Original file name: "tui2_international_tuition_fees_at_public_post_secondary_institutions_by_economic_development_r.xlsx"
https://catalogue.data.gov.bc.ca/dataset/international-tuition-fees-at-public-post-secondary-institutions-by-economic-development-region
License:
Open Government Licence - British Columbia
Statistics Canada has made available an anonymized Public-Use Microdata File (PUMF) of the Joint Canada/United States Survey of Health, a telephone survey conducted in late 2002 and early 2003. There were 8,688 respondents to the survey, 3,505 Canadians and 5,183 Americans.
The PUMF is a fixed-width file named "JCUSH.txt". Each line is 552 columns in length.
The webpage for the survey, including the PUMF file, data dictionary, and methodological notes, is here: https://www150.statcan.gc.ca/n1/pub/82m0022x/2003001/4069119-eng.htm
License:
Statistics Canada Open License
Source: Statistics Canada, Joint Canada/United States Survey of Health 2002-03, 2004. Reproduced and distributed on an "as is" basis with the permission of Statistics Canada.
Total employment in Canada, monthly, unadjusted for seasonality, both sexes, age 15 and over, total all population centres and rural areas (thousands), 2011-01 to 2022-12
Table 14-10-0374-01, vector v1234977815
License:
Statistics Canada Open License
Source: Statistics Canada. Table 14-10-0374-01 Employment and unemployment rate, monthly, unadjusted for seasonality DOI: https://doi.org/10.25318/1410037401-eng
A csv version of the famous mpg dataset, included in the {ggplot2} package
The penguins
dataframe from the {palmerpenguins} dataset, in different formats:
penguins.csv
# code to create file readr::write_csv(palmerpenguins::penguins, here::here("penguins.csv"))
penguins.sas
# code to create file haven::write_sas(palmerpenguins::penguins, here::here("penguins.sas"))
penguins.sav
# code to create file haven::write_sav(palmerpenguins::penguins, here::here("penguins.sav"))
penguins.dta
# code to create file haven::write_dta(palmerpenguins::penguins, here::here("penguins.dta"))
A fixed-width version of the {palmerpenguins} dataset
There are 8 different variables, described in the table below:
Variable Width Start position End position -------- ----- -------------- ------------ species 9 1 9 island 9 10 18 bill_length_mm 4 19 22 bill_depth_mm 4 23 26 flipper_length_mm 3 27 29 body_mass_g 4 30 33 sex 6 34 39 year 4 40 43
The fixed-width file has been created to minimize white space. The first four and last two rows of the data look like this:
readLines(dpjr::dpjr_data("penguins_fwf.txt"), n = 4)
tail(readLines(dpjr::dpjr_data("penguins_fwf.txt")), 2)
Note that the first row is not the variable names. This is common in fixed-width files.
# code to create fwf file gdata::write.fwf(penguins_df, file = "penguins_fwf.txt", width = c( 9, # species 9, # island 4, # bill_length_mm 4, # bill_depth_mm 3, # flipper_length_mm 4, # body_mass_g 6, # sex 4 # year ), sep = "", colname = FALSE)
A fixed-width version of the {palmerpenguins} dataset, where the character strings associated with the values have been replaced by numeric values.
There are 8 different variables, described in the table below:
Variable Width Start position End position -------- ----- -------------- ------------ species 1 1 1 island 1 2 2 bill_length_mm 4 3 6 bill_depth_mm 4 7 10 flipper_length_mm 3 11 13 body_mass_g 4 14 17 sex 1 18 18 year 4 19 22
Variable Value Character -------- ----- --------- species 1 Adelie 2 Gentoo 3 Chinstrap island 1 Torgersen 2 Biscoe 3 Dream sex 1 female 2 male
The fixed-width file is now less than half the width of the fixed-width file that has the character values. The first four and last two rows of the data look like this:
readLines(dpjr::dpjr_data("penguins_fwf_code.txt"), n = 4)
tail(readLines(dpjr::dpjr_data("penguins_fwf_code.txt")), 2)
Note that the first row is not the variable names. This is common in fixed-width files.
An Excel version of the {palmerpenguins} dataset.
This file simulates the circumstance where the data were originally collected and stored in an SPSS file, and the data collectors have made the data available as an Excel file.
The file contains two sheets,
penguins_values, containing the individual records, with the variables coded numerically. Note that for all variables, "NA" is used to represent missing values.
penguins_codebook contains the output SPSS creates when the "DISPLAY DICTIONARY" syntax (or the "File > Display Data File Information > Working File" GUI menu sequence) in SPSS is used to produce the "dictionary" (or codebook). There is information about each of the variables, including whether that variable is nominal or scale..
Variable Value Character -------- ----- --------- species 1 Adelie 2 Chinstrap 3 Gentoo island 1 Biscoe 2 Dream 3 Torgersen sex 1 female 2 male
Note that the National Travel Survey (NTS) data referenced in Chapter 7 of The Data Preparation Journey is not included in the {dpjr} package. The microdata file for the 2020 reference year can be found here:
https://doi.org/10.25318/24250001
Additional information about the NTS can be found here: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&Id=1535550&
License:
Data are available by CC-0 license in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.
-30-
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.