View source: R/dplyr-distinct.R
| distinct.data_request | R Documentation |
Keep only unique/distinct rows from a data frame. This is similar to
unique.data.frame() but considerably faster. It is evaluated lazily.
## S3 method for class 'data_request'
distinct(.data, ..., .keep_all = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Variables to use when determining uniqueness. Unlike the |
.keep_all |
If |
This function has several potential uses. In it's default mode, it simply shows the unique values for a supplied field:
galah_call() |> distinct(basisOfRecord) |> collect() # A tibble: 9 × 1 basisOfRecord <chr> 1 HUMAN_OBSERVATION 2 PRESERVED_SPECIMEN 3 OCCURRENCE 4 MACHINE_OBSERVATION 5 OBSERVATION 6 MATERIAL_SAMPLE 7 LIVING_SPECIMEN 8 FOSSIL_SPECIMEN 9 MATERIAL_CITATION
This is the same result as you would get using show_values():
search_all(fields, "basisOfRecord") |> show_values()
Using distinct() is somewhat more reliable, however, as it doesn't rely
on searching the tibble returned by show_all(fields). It is also more
efficient, particularly when caching is turned off. If the goal is to
retrieve the number of levels of a factor, use:
galah_call() |> distinct(basisOfRecord) |> count() |> collect() # A tibble: 1 × 1 count <int> 1 9
When the variable passed to distinct() in the above example is
speciesID, this is identical to calling:
atlas_counts(type = "species")
You can also pass group_by()
to find the number of facets per level of a second variable:
galah_call() |>
identify("Perameles") |>
distinct(speciesID) |>
group_by(basisOfRecord) |>
count() |>
collect()
# A tibble: 8 × 2
basisOfRecord count
<chr> <int>
1 Human observation 7
2 Preserved specimen 9
3 Machine observation 2
4 Observation 3
5 Occurrence 3
6 Material Sample 4
7 Fossil specimen 1
8 Living specimen 1
By setting .keep_all = TRUE, we get more information on each record.
Due to limits on the APIs this is not a perfect analogy for running
dplyr::distinct() on raw occurrences; but it does allow us to
generalise atlas_species() to use any taxonomic identifier. For example,
we might choose to show data by family instead of species:
galah_call() |>
identify("Coleoptera") |>
distinct(familyID, .keep_all = TRUE) |>
collect()
Using group_by() is also valid:
galah_call() |>
filter(year == 2024,
genus == "Crinia") |>
group_by(speciesID) |>
distinct(.keep_all = TRUE) |>
collapse()
In this case, collect() and
atlas_species() are synonymous, with the exception that the latter
does not require you to set the .keep_all argument to TRUE. So you
could instead use:
galah_call() |>
identify("Coleoptera") |>
distinct(familyID) |>
atlas_species()
## Not run:
galah_call() |>
distinct(basisOfRecord) |>
count() |>
collect()
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.