# library(rooc) # source("../R/Doc.R") # source("../R/Globals.R") pulse <- read_pulse()
If the keys are not unique in either of the tibbles then the result will be ambiguous. Take for example df1
defined above, we now add another observation with name Yara
and height 168
:
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"), height = c(160, 172, 182, 157, 162, 168)) df1
This will make the key, variable name
, non-unique. Notice the result of a full join:
full_join(df1, df2, "name")
As you can see the join found two matches for Yara in df1
, so both are taken and the age for Yara in df2
is replicated.
r info_block("Use only unique key in joins to avoid ambiguous results.")
What if we update df2 also with name Yara
with age 19
and apply a join?
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara", "Yara"), age = c(20,24,17,23,17,19)) df2
full_join(df1,df2,"name")
As you can see we have now 4 ambiguous observations for Yara
Take df1
as defined above and df3
:
# height (cm) df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"), height = c(160, 172, 182, 157, 162)) # height (feet) df3 <- df1 %>% mutate(height=height/30.48) df3
both of these tibbles have the variable height, after joining by name
:
left_join(df1,df3,by="name")
we see that left_join
distinguishes height
from df1
and the height
from df3
with suffixes .x
and .y
respectively.
There are several more join functions not covered here such as right_join (opposite of left_join), semi_join and anti_join (see dplyr cheat sheet).
The base function duplicated
can be used to find duplicates in a vector or a tibble.
v <- c("Isa","Jaylinn","Mila","Bonnie", "Yara", "Mila", "Isa", "Mila") v
Here Isa
and Mila
occur two and three times respectively. The function duplicated
returns a logical with TRUE at corresponding positions whenever it matches a value that it had already seen from position 1 to n, n being the length of the vector in this case.
duplicated(v)
r info_block("duplicated(...) function does not mark the first occurence of the duplicated values with TRUE.")
On tibbles the function duplicated
can handle multiple variables(columns):
# the two observations on Yara are identical df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"), height = c(160, 172, 182, 157, 162, 162)) df1 duplicated(df1)
We can use this logical vector to filter observations from df1
:
df1 %>% filter(duplicated(df1))
Note that we only get 1 row as the result and not 2 because of the behaviour of duplicated function which does not mark the first match.
We can now test whether certain combinations of variables can uniquely identify each observation in the dataset. Let us take the variables {name,age} in the pulse dataset:
key1 <- pulse %>% select(name,age) key1
If the duplicated
function returns FALSE for all positions in key1
tibble then it means it did not find any duplicates.
# sum over a logical vector counts the number of TRUE values sum(duplicated(key1))
the sum here is non-zero therefore there are duplicated values, i.e. the key set key1
is not unique.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.