In akrmenec/efficient-R: Becoming an Efficient R Programmer

Exercise {-}

Create the following data frame

df_base = data.frame(colA = "A")

Try and guess the output of the following commands

print(df_base)
df_base$colA
df_base$col
df_base$colB

Now create a tibble data frame and repeat the above commands.

library("tibble")
df_tibble = tibble::tibble(colA = "A")

The tibble does not guess the column names:

print(df_tibble)
## # A tibble: 1 × 1
##    colA
##   <chr>
## 1     A
df_tibble$colA
## [1] "A"
df_tibble$col
## Warning: Unknown column 'col'
df_tibble$colB
## Warning: Unknown column 'colB'

Exercises {-}

What are the three criteria of tidy data?

Each variable forms a column

Each observation forms a row

Each unit of analysis (e.g. person, household) forms a single table

These answers can be found in the tidyr vignette:

r vignette(package = "tidyr") vignette("tidy-data")
Load and look at subsets of these datasets. The first is the pew datasets we've been using already. The second reports the points that define, roughly, the geographical boundaries of different London boroughs. What is 'untidy' about each?

data(pew, package = "efficient")
head(pew, 10)
data(lnd_geo_df, package = "efficient")
head(lnd_geo_df, 10)

##                    name_date population        x        y
## 1               Bromley-2001     295535 544361.6 172379.2
## 2               Bromley-2001     295535 549546.0 169911.0
## 3               Bromley-2001     295535 539596.3 160796.3
## 4               Bromley-2001     295535 533693.2 170730.1
## 5               Bromley-2001     295535 533717.7 170813.8
## 6               Bromley-2001     295535 534003.7 171442.1
## 7               Bromley-2001     295535 541104.7 173355.7
## 8               Bromley-2001     295535 544361.6 172379.2
## 9  Richmond upon Thames-2001     172330 523604.7 176320.7
## 10 Richmond upon Thames-2001     172330 521454.5 172361.5

In the pew dataset, not every column is a variable: the single variable of income band is represented by many columns.

Convert each of the above datasets into tidy form.

As illustrated previously, gather() converts wide data into long, tidy forms:

library("tidyr")
gather(pew, Income, count, -religion) # results not shot shown

In the lnd_geo_df dataset, not every column is a distinct variable: name_date contains two variables (the name of the zone and the date). This can be fixed as follows:

tidy_geo = separate(lnd_geo_df, name_date,
                  into = c("name", "date"), sep = -5)
unique(tidy_geo[1:2])[1:3,]

##                     name date
## 1               Bromley- 2001
## 9  Richmond upon Thames- 2001
## 15           Hillingdon- 2001

Consider the following string of phone numbers and fruits from the stringr package:

strings = c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
             "387 287 6718", "apple", "233.398.9187  ", "482 952 3315", "239 923 8115",
             "842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679")

Write functions in stringr and base R that return:

A logical vector reporting whether or not each string contains a number.
Complete words only, without extraneous non-letter characters.

r library(stringr) str_detect(string = strings, pattern = "[0-9]") str_extract(strings, pattern = "[A-z]+")
Try running each of the chaining examples above line-by-line, so the first two entries for the first example would look like this:

library("dplyr")
data("wb_ineq", package="efficient")

followed by:

wb_ineq %>% 
  select(Year, gini)

Explain in your own words what changes each time.

wb_ineq %>% 
  select(Year, gini) %>% # select only named columns
  mutate(decade = floor(Year / 10) * 10) %>% # create decade column
  group_by(decade) %>% # group the dataset
  summarise(mean(gini, na.rm = TRUE)) # find the mean gini index per year

## # A tibble: 6 × 2
##   decade `mean(gini, na.rm = TRUE)`
##    <dbl>                      <dbl>
## 1   1970                   40.05333
## 2   1980                   37.84583
## 3   1990                   42.02168
## 4   2000                   40.54188
## 5   2010                   39.28561
## 6     NA                        NaN

Use chained dplyr functions to answer the following question: In which year did countries without and 'a' in their name have the lowest level of inequality?

1981, as illustrated below:

wb_ineq %>% 
  filter(!grepl("a", Country)) %>%
  filter(!is.na(gini)) %>% 
  group_by(Year) %>% 
  summarise(mean_gini = min(mean(gini), na.rm = TRUE)) %>% 
  top_n(-1, mean_gini)

## # A tibble: 1 × 2
##    Year mean_gini
##   <int>     <dbl>
## 1  1981     23.29

Note: the question is open to interpretation. In the above answer, 'lowest level of inequality' is interpreted as 'mean gini index for all countries without a in their name per year'.

Exercises {-}

Refer back to the greenhouse gas emissions example, in which we found the top 3 countries in terms of emissions growth in the transport sector. a) Explain in words what is going on in each line. b) Try to find the top 3 countries in terms of emissions in 2012 - how is the list different?

Russia is now present on the list - it has high absolute transport emissions, but relatively low levels of emissions growth.

library("dplyr")
data("ghg_ems", package = "efficient")
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  filter(Year == 2012) %>% 
  top_n(3, Transportation) %>%
  arrange(desc(Transportation))

## # A tibble: 3 × 7
##              Country  Year Electricity Manufacturing Transportation  Other
##                <chr> <dbl>       <dbl>         <dbl>          <dbl>  <dbl>
## 1      United States  2012     2369.60        495.41        1667.26 541.88
## 2              China  2012     4404.92       2546.06         702.91 551.97
## 3 Russian Federation  2012      995.08        293.48         235.22 135.25
## # ... with 1 more variables: Fugitive <dbl>

Explore dplyr's documentation, starting with the introductory vignette, accessed by entering vignette("introduction").
Test additional dplyr 'verbs' on the wb_ineq dataset. (More vignette names can be discovered by typing vignette(package = "dplyr").)

Exercise {-}

Follow the worked example below to create and query a database on land prices in the UK using dplyr as a front end to an SQLite database. The first stage is to read-in the data:

# See help("land_df", package="efficient") for details
data(land_df, package="efficient")

The next stage is to create an SQLite database to hold the data:

# install.packages("RSQLite") # Requires RSQLite package
my_db = src_sqlite("land.sqlite3", create = TRUE)
land_sqlite = copy_to(my_db, land_df, indexes = list("postcode", "price"))

What class is the new object land_sqlite?

It's a table with additional attributes to show it's also an SQLite database.

class(land_sqlite)

Why did we use the indexes argument?

For fast sorting on these variables.

From the above code we can see that we have created a tbl. This can be accessed using dplyr in the same way as any data frame can. Now we can query the data. You can use SQL code to query the database directly or use standard dplyr verbs on the table.

# Method 1: using sql
tbl(my_db, sql('SELECT "price", "postcode", "old/new"  FROM land_df'))

How would you erform the same query using select()? Try it to see if you get the same result (hint: use backticks for the old/new variable name).

# Method 2: using dplyr
select(land_sqlite, price, postcode, `old/new`)

akrmenec/efficient-R documentation built on May 28, 2019, 4:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

akrmenec/efficient-R
Becoming an Efficient R Programmer

In akrmenec/efficient-R: Becoming an Efficient R Programmer

Exercise {-}

Exercises {-}

Exercises {-}

Exercise {-}

R Package Documentation

Browse R Packages

We want your feedback!

akrmenec/efficient-R Becoming an Efficient R Programmer

In akrmenec/efficient-R: Becoming an Efficient R Programmer

Exercise {-}

Exercises {-}

Exercises {-}

Exercise {-}

R Package Documentation

Browse R Packages

We want your feedback!

akrmenec/efficient-R
Becoming an Efficient R Programmer