title: "CSO Package for R" author: "Brendan O'Dowd" date: "05 September, 2019" output: html_document: keep_md: true
The main tool within the CSO package is a function get_cso()
, which allows you to import statistical data directly from the CSO Statbank into the R environment. This document gives a couple of examples using get_cso()
, and shows how some simple plots can be made.
The CSO package can be installed from GitHub as follows. Note that the function install_github()
is part of the devtools package, so check that you have that package installed if you encounter any problems.
devtools::install_github("brendanjodowd/CSO" )
library(CSO)
The following vignettes require a couple of additional packages:
%>%
Topics introduced
get_cso()
unique()
filter()
str_detect()
select()
ggplot()
, along with geom_point()
and geom_line()
labs()
, xlim()
, ylim()
and theme_light()
Each Statbank table has a five character code, and this is used as the argument for get_cso()
. To begin with, we will import the monthly unemployment data, which has the code "MUM01" on Statbank.
unemp <- get_cso("MUM01")
The time variables found in CSO Statbank tables include Year, Quarter and Month. Month comes in as a date, while Year arrives as a simple numeric variable. Quarter is of the form (e.g.) 1996Q1.
Tip: Using unique() to view distinct elements in a columnYou can have a quick look at the data frame unemployment.data
using head(unemployment.data)
. It contains categorical variables for 'Age.group', 'Sex' and 'Statistic'. Then there is a variable called 'Month', and the numerical data itself is stored in a variable called 'value'.
If I'm unsure as to what categories are present in each variable, I usually use the function unique()
, e.g.:
unique(unemp$Sex)
unique(unemp$Age.Group)
unique(unemp$Statistic)
## [1] "Both sexes" "Male" "Female"
## [1] "15 - 24 years" "15 - 74 years" "25 - 74 years"
## [1] "Seasonally Adjusted Monthly Unemployment (Thousand)"
## [2] "Seasonally Adjusted Monthly Unemployment Rate (%)"
Let's use filter()
to examine the unemployment rate for both sexes, among 15 to 74 year-olds. To restrict the variable Statistic to just 'Seasonally Adjusted Monthly Unemployment Rate (%)', we could use filter(Statistic=="Seasonally Adjusted Monthly Unemployment Rate (%)")
, but this is a bit cumbersome. Instead, try using str_detect
with the phrase "Rate" which appears in just one of the two categories of Statistic.
unemp1 <- unemp %>%
filter(Age.Group == "15 - 74 years") %>%
filter(Sex=="Both sexes") %>%
filter(str_detect(Statistic , "Rate"))
We'll use ggplot for plotting as this has a wide array of formatting options. The first argument is the name of the dataset itself. Then the x and y variables are indicated within the aes()
function. We can specify a scatter plot using + geom_point()
as below, or a line plot using + geom_line()
ggplot(unemp1, aes(Month, value)) + geom_point()
Suppose we wanted to compare Male and Female unemployment over this period. We can make a very slight adjustment to our previous code, adding an exclamation mark before Sex=="Both sexes"
to exclude this category, leaving the categories for Male and Female.
unemp2 <- unemp %>%
filter(Age.Group == "15 - 74 years") %>%
filter(!Sex=="Both sexes") %>%
filter(str_detect(Statistic , "Rate"))
Now we can use colour=Sex
in the aes()
function to give the Male and Female categories different colours:
ggplot(unemp2, aes(Month, value, colour=Sex)) + geom_line()
Try using + facet_wrap(vars(Sex))
to produce an array of plots. One nice thing with facet_wrap
is that all axes have the same limits.
There are a couple of ways to set limits on the dates in the plot. One way is to filter the data itself using another filter()
statement, like filter(Month >= as.Date("2007-01-01"))
, which will provide output only since 2007. Note the use of as.Date()
, and that the date is expressed in the form YYYY-MM-DD, which is the default for the as.Date()
function. Another way is to use + xlim()
in the ggplot statement, as shown in the following example. I'm also specifying y-limits of 0 to 17 here using + ylim(0,17)
.
ggplot(unemp2, aes(Month, value, colour=Sex)) + geom_line() +
xlim(as.Date("2007-01-01"),as.Date("2010-01-01")) +
ylim(0,17)
Let's finish up by using labs()
to add labels, and adding + theme_light()
to give the graph a different look. You can look at some of the other themes that are available here. I'm including both geom_line()
and geom_point()
, with the points assigned a size of 2.
ggplot(unemp2, aes(Month, value, colour=Sex)) + geom_line() + geom_point(size=2) +
xlim(as.Date("2007-01-01"),as.Date("2010-01-01")) +
ylim(0,17) +
labs(x = "", y="Percent", title="Monthly Unemployment") +
theme_light()
Topics introduced
bind_rows()
left_join()
mutate()
We're going to generate two datasets, one on population and the other on overseas travel by Irish residents. These will be used to demonstrate appending and joining datasets. I'm using glimpse()
here, which is a function in dplyr, to have a quick look at the data.
travel <- get_cso("TMA08") %>%
filter(Statistic=="Overseas Trips by Irish Residents (Thousand)") %>%
filter(Reason.for.Journey=="All reasons for journey")
population <- get_cso("PEA01") %>%
filter(Age.Group=="All ages" ) %>%
filter(Sex == "Both sexes")
|Reason.for.Journey | Year|Statistic | value| |:-----------------------|----:|:--------------------------------------------|-----:| |All reasons for journey | 2009|Overseas Trips by Irish Residents (Thousand) | 7021| |All reasons for journey | 2010|Overseas Trips by Irish Residents (Thousand) | 6660| |All reasons for journey | 2011|Overseas Trips by Irish Residents (Thousand) | 6293| |All reasons for journey | 2012|Overseas Trips by Irish Residents (Thousand) | 6326| |All reasons for journey | 2013|Overseas Trips by Irish Residents (Thousand) | 6323| |All reasons for journey | 2014|Overseas Trips by Irish Residents (Thousand) | 6514|
|Age.Group |Sex | Year|Statistic | value| |:---------|:----------|----:|:--------------------------------------------------|------:| |All ages |Both sexes | 1950|Population Estimates (Persons in April) (Thousand) | 2969.0| |All ages |Both sexes | 1951|Population Estimates (Persons in April) (Thousand) | 2960.6| |All ages |Both sexes | 1952|Population Estimates (Persons in April) (Thousand) | 2952.9| |All ages |Both sexes | 1953|Population Estimates (Persons in April) (Thousand) | 2949.0| |All ages |Both sexes | 1954|Population Estimates (Persons in April) (Thousand) | 2941.2| |All ages |Both sexes | 1955|Population Estimates (Persons in April) (Thousand) | 2920.9|
Let's append these using bind_rows()
:
trips_and_pop <- bind_rows(travel , population)
|Reason.for.Journey | Year|Statistic | value|Age.Group |Sex | |:-----------------------|----:|:--------------------------------------------------|------:|:---------|:----------| |All reasons for journey | 2009|Overseas Trips by Irish Residents (Thousand) | 7021.0|NA |NA | |All reasons for journey | 2010|Overseas Trips by Irish Residents (Thousand) | 6660.0|NA |NA | |All reasons for journey | 2011|Overseas Trips by Irish Residents (Thousand) | 6293.0|NA |NA | |All reasons for journey | 2012|Overseas Trips by Irish Residents (Thousand) | 6326.0|NA |NA | |All reasons for journey | 2013|Overseas Trips by Irish Residents (Thousand) | 6323.0|NA |NA | |All reasons for journey | 2014|Overseas Trips by Irish Residents (Thousand) | 6514.0|NA |NA | |All reasons for journey | 2015|Overseas Trips by Irish Residents (Thousand) | 6965.0|NA |NA | |All reasons for journey | 2016|Overseas Trips by Irish Residents (Thousand) | 7405.0|NA |NA | |All reasons for journey | 2017|Overseas Trips by Irish Residents (Thousand) | 7939.0|NA |NA | |All reasons for journey | 2018|Overseas Trips by Irish Residents (Thousand) | 8276.0|NA |NA | |NA | 1950|Population Estimates (Persons in April) (Thousand) | 2969.0|All ages |Both sexes | |NA | 1951|Population Estimates (Persons in April) (Thousand) | 2960.6|All ages |Both sexes | |NA | 1952|Population Estimates (Persons in April) (Thousand) | 2952.9|All ages |Both sexes | |NA | 1953|Population Estimates (Persons in April) (Thousand) | 2949.0|All ages |Both sexes |
Topics introduced
mutate()
geom_col()
, and flipping this using + coord_flip()
geom_area()
reorder(category, value)
Suppose I want to examine the numbers of burglaries in West Dublin. I start by importing the file with annual crime statistics, which has the code 'CJA07'. This includes the variables 'Garda.Station', 'Type.of.Offence', 'Year', 'Statistic' and 'value'. I can find just those Garda stations in the western part of the Dublin Metropolitan Region (D.M.R.) by filtering matches to the string 'D.M.R. Western'. Then I find matches to 'Burglary' in 'Type.of.Offence', anbd restrict to only the years 2012, 2014 and 2016. Next, I want to mutate
the names of the Garda stations -- they all include the words ' Division, D.M.R. Western' after the town name which I feel is redundant, so I select just the first word in 'Garda.Station' using word()
, which is a function from the stringr package. Finally, I remove two redundant variables (Statistic and Type.of.Offence).
dublin_crime <- get_cso("CJA07") %>%
filter(str_detect(Garda.Station, "D.M.R. Western")) %>%
filter(str_detect(Type.of.Offence, "Burglary")) %>%
filter(Year %in% c(2012, 2014, 2016)) %>%
mutate(Garda.Station = word(Garda.Station, 1, sep=",")) %>%
select(-Statistic, -Type.of.Offence)
Now let's plot the number of burglaries for each of these stations by year.
ggplot(dublin_crime , aes(Year, value, colour=Garda.Station)) + geom_line()
To do a column chart or bar chart, I always use geom_col()
. You can use geom_bar()
, but that's designed for counting instances of each entry, and needs a special statement to plot a particular value.
ggplot(dublin_crime , aes(Year, value, fill=Garda.Station)) + geom_col()
Notice that fill =
is used instead of colour =
. In R, colour
always refers to lines whereas fill
refers to the colour of shapes. You can make grouped rather than stacked columns by using geom_col(position = "dodge")
, and 100% stacked columns using geom_col(position = "fill")
. You can also make a stacked area plot by using geom_area()
instead of geom_col()
.
Try swapping Year
with Garda.Station
in your column plot code so that the primary grouping along the axis is by Garda Station. You will notice that the fill colour is a gradient and that the dodge option is no longer possible. This is because Year
is a numerical variable. You can use mutate
to convert it to a factor or a string, or simply wrap Year
in the function factor()
to use it as a factor. Here we also create a horizontal chart by adding coord_flip()
.
ggplot(dublin_crime , aes(Garda.Station, value, fill=factor(Year))) +
coord_flip() +
geom_col(position="dodge")
Very often, we will want to find the average (or median, or max...) for a group within our dataset. We might also want to compare individual values with averages for their group. This is done using group_by
in combination with either summarise
or mutate
.
Let's take the first problem and calculate the average number of burglaries per year for each Garda Station:
dublin_averages <- dublin_crime %>%
group_by(Garda.Station) %>%
summarise(avg.burglaries = mean(value))
dublin_averages
## # A tibble: 8 x 2
## Garda.Station avg.burglaries
## <chr> <dbl>
## 1 Ballyfermot 205
## 2 Blanchardstown 672.
## 3 Cabra 121
## 4 Clondalkin 316.
## 5 Finglas 264
## 6 Lucan 226.
## 7 Rathcoole 119.
## 8 Ronanstown 210.
Notice that this contains one row for each Garda Station. Now let's look at the second option, where we want to calculate the mean for each Garda Station, but keep each of the original rows so that we can compare the means to the original data:
dublin_averages_2 <- dublin_crime %>%
group_by(Garda.Station) %>%
mutate(avg.burglaries = mean(value))
head(dublin_averages_2)
## # A tibble: 6 x 4
## # Groups: Garda.Station [2]
## Garda.Station Year value avg.burglaries
## <chr> <dbl> <int> <dbl>
## 1 Blanchardstown 2012 791 672.
## 2 Blanchardstown 2014 694 672.
## 3 Blanchardstown 2016 532 672.
## 4 Cabra 2012 136 121
## 5 Cabra 2014 149 121
## 6 Cabra 2016 78 121
We can now make a new variable called 'Status', equal to 'Above', 'Below' or 'Same' depending on the relationship between each value and the average for that Station. Here we will use the function case_when
. Very often you see a different function: if_else()
(or ifelse()
in base R). I don't like if_else()
if there are more than two options because you end up with very complicated nested functions, whereas with case_when
the different options are neatly separated by commas.
dublin_averages_3 <- dublin_averages_2 %>%
mutate(Status = case_when(
value > avg.burglaries ~ "Above",
value < avg.burglaries ~ "Below",
value == avg.burglaries ~ "Same"
)
)
head(dublin_averages_3)
## # A tibble: 6 x 5
## # Groups: Garda.Station [2]
## Garda.Station Year value avg.burglaries Status
## <chr> <dbl> <int> <dbl> <chr>
## 1 Blanchardstown 2012 791 672. Above
## 2 Blanchardstown 2014 694 672. Above
## 3 Blanchardstown 2016 532 672. Below
## 4 Cabra 2012 136 121 Above
## 5 Cabra 2014 149 121 Above
## 6 Cabra 2016 78 121 Below
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.