distinct.data_request: Keep distinct/unique rows

View source: R/dplyr-distinct.R

distinct.data_requestR Documentation

Keep distinct/unique rows

Description

Keep only unique/distinct rows from a data frame. This is similar to unique.data.frame() but considerably faster. It is evaluated lazily.

Usage

## S3 method for class 'data_request'
distinct(.data, ..., .keep_all = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...

Variables to use when determining uniqueness. Unlike the dplyr implementation this must be set for the function to do anything, and only a single variable is used.

.keep_all

If TRUE, keep all variables in .data. Defaults to FALSE

Details

This function has several potential uses. In it's default mode, it simply shows the unique values for a supplied field:

galah_call() |>
  distinct(basisOfRecord) |> 
  collect()

# A tibble: 9 × 1
  basisOfRecord      
  <chr>              
1 HUMAN_OBSERVATION  
2 PRESERVED_SPECIMEN 
3 OCCURRENCE         
4 MACHINE_OBSERVATION
5 OBSERVATION        
6 MATERIAL_SAMPLE    
7 LIVING_SPECIMEN    
8 FOSSIL_SPECIMEN    
9 MATERIAL_CITATION

This is the same result as you would get using show_values():

search_all(fields, "basisOfRecord") |> 
  show_values()

Using distinct() is somewhat more reliable, however, as it doesn't rely on searching the tibble returned by show_all(fields). It is also more efficient, particularly when caching is turned off. If the goal is to retrieve the number of levels of a factor, use:

galah_call() |>
  distinct(basisOfRecord) |> 
  count() |>
  collect()

# A tibble: 1 × 1
  count
  <int>
1     9

When the variable passed to distinct() in the above example is speciesID, this is identical to calling:

atlas_counts(type = "species")

You can also pass group_by() to find the number of facets per level of a second variable:

galah_call() |>
  identify("Perameles") |>
  distinct(speciesID) |> 
  group_by(basisOfRecord) |>
  count() |>
  collect()

# A tibble: 8 × 2
  basisOfRecord       count
  <chr>               <int>
1 Human observation       7
2 Preserved specimen      9
3 Machine observation     2
4 Observation             3
5 Occurrence              3
6 Material Sample         4
7 Fossil specimen         1
8 Living specimen         1

By setting .keep_all = TRUE, we get more information on each record. Due to limits on the APIs this is not a perfect analogy for running dplyr::distinct() on raw occurrences; but it does allow us to generalise atlas_species() to use any taxonomic identifier. For example, we might choose to show data by family instead of species:

galah_call() |>
  identify("Coleoptera") |>
  distinct(familyID, .keep_all = TRUE) |> 
  collect()

Using group_by() is also valid:

galah_call() |>
    filter(year == 2024,
           genus == "Crinia") |>
    group_by(speciesID) |>
    distinct(.keep_all = TRUE) |>
    collapse()

In this case, collect() and atlas_species() are synonymous, with the exception that the latter does not require you to set the .keep_all argument to TRUE. So you could instead use:

galah_call() |>
  identify("Coleoptera") |>
  distinct(familyID) |> 
  atlas_species()

Examples

## Not run: 
galah_call() |>
  distinct(basisOfRecord) |>
  count() |>
  collect()

## End(Not run)

galah documentation built on Feb. 11, 2026, 9:11 a.m.