library("dplyr") library("tidyr") library("ggplot2") data(okcupid, package = "jrTidyverse")
separate()
The original state of the okcupid data has numerous messy columns. Let's tidy some of them up. First, the location variable currently stores both the area and the state. We can separate it into two further variables using separate()
okcupid = okcupid %>% separate(location, c("area", "state"), sep = ", ")
Notice the warning, R is just telling us that in one of the rows there was two commas and so there were three pieces of information. In this case it was the inclusion of country information on top of the area and state. Next up, ethnicity. In some cases people have listed 3 ethnicities per person. We are only interested in the first one (obviously for actual analysis this is not recommended as it a gross under representation of todays multi ethnic society). Again we can use separate to rid of everything after first ethnicity
okcupid = okcupid %>% separate(ethnicity, "ethnicity", sep = ", ")
separate()
to separate the speaks
variable into $3$ different columns then use count()
)okcupid = okcupid %>% separate(speaks, c("first_lan", "sec_lan", "third_lan"), sep = ", ") okcupid %>% count(first_lan) # 168
separate()
operation on the column sec_lan
)okcupid = okcupid %>% separate(sec_lan, "second_lan", sep = " ")
okcupid %>% count(second_lan) %>% filter(second_lan == "c++")
separate()
to extract a persons religion, given that a persons religion is always stated as the first word in the column "religion". okcupid %>% separate(religion, c("religion"), sep = " ") %>% count(religion) %>% ggplot(aes(x = religion, y = n)) + geom_col() + coord_flip()
okcupid %>% separate(religion, c("religion"), sep = " ") %>% count(religion) %>% ggplot(aes(x = religion, y = n)) + geom_col() + coord_flip()
pivot_wider()
Long format data is great for us as R programmers as it is a convenient format for lots of things that we wish to do with our data. However it is not always the most useful way to share a table with others. An example of this might be as follows.
The code
(df = okcupid %>% group_by(sex) %>% count(orientation) )
creates a data structure that works well for ggplot but is more difficult to read in a report.
pivot_wider()
to give a column for each orientation, filled with the counts. Does it look a bit nicer?df %>% pivot_wider(names_from = orientation, values_from = n)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.