Matching personal names is a perennial difficulty in the study of art history, as researchers must deal not only with variant spellings in collections around the world (looking for "Rembrandt van Rijn"? Don't forget to check for mentions of "رامبرانت"!) but also variant spellings that have appeared in historical documents (e.g. "Rembrandt Hermanszoon van Rijn"). The Getty's Union List of Artist Names was expressly designed to catalog both modern-day spellings as well as historical variants and occurrences, providing museums an authority list to aid in describing their own collections. However, unless the ULAN has the exact spelling of an artist's name in their directory, finding candidate matches still involves a lot of manual work.

This package aims to partially automate the identification of matching candidates in the ULAN by fuzzy matching.

Local and Remote Methods

ulan_match takes a character vector of names, and returns a named list of data frames with the attributes of candidate matches.

method = "local" will search for results from a table of the ULAN's alternate names (available through the ulanrdata package), returning a named list with one data.frame per input name, each listing ULAN entities with alternate names that have a high character-level cosine similarity to the input names. When the input name matches one of the ULAN alternate names exactly, the local method will return only those matches, without running costly string similarity measurements. (Note that it is possible for two separate individuals, e.g. Vincent van Gogh (1820-1888) and Vincent van Gogh (1853-1890) to match the same variant name spelling.)

library(ulanr)
ulan_match(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "local")

An alternate method, method = "sparql", works by directly querying the Getty's live SPARQL endpoint, relying on their Lucene index to search for similar matches across the alternate and preferred names for all artists.

ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql")

This format plays nicely with dplyr::bind_rows(), whose .id argument will allow you to create a column of these list names in one unified data frame:

suppressPackageStartupMessages(library(dplyr))
ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql") %>%
  bind_rows(.id = "original_name")

Date restrictions

You may have more than just a name when searching for an artist - you may also know when they were alive. Use the early_year and late_year arguments to establish bounds for match candidates. When strictly_between = FALSE (the default), matches will be allowed when the input lifespan intersects with the lifespan defined by the ULAN. When strictly_between = TRUE, then ULAN matches' life dates must fall completely within the early_year:late_year range.

ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = FALSE, method = "sparql")

ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = TRUE, method = "sparql")

You may supply vectors to early_year and late_year of the same length as names, or you can alternately provide them with a single value that will be recycled for all queries.

cutoff_score and max_results

The data.frame returned for each name given to ulan_match contains a score column with the similarity score returned by either the cosine similarity metric used for the local method, or the Lucene index used by the sparql method. These scores are not directly comparable across methods. The cosine similarity score may range between 0 to 1, while the Lucene result score, in this particular query environment, tends to range between between 0 and 12 or more. (Exercise care when interpreting the Lucene result scores!)

If cutoff_score is set to NULL, then sane defaults are used based on the given method (0.95 for local and 3 for sparql) that will sift out many false positive matches. If not match is found above the cutoff score, ulan_match will return a data frame with 1 row of NA values.

Set cutoff_score to 0 to return results of any score.

You may also set the maximum number of results to be returned via max_results. This defaults to 5, though for method = "sparql" any number higher than 50 will be ignored, out of politeness towards the Getty's endpoint.

ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = 0, max_results = 10, method = "local")
ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = NULL, max_results = 10, method = "local")

ulan_id and ulan_data

These are utility wrapper functions that return only the top match of ulan_match. ulan_id returns a vector of ID numbers, while ulan_data returns a single data frame with all the columns from the regular results of ulan_match, along with a name column matching the original vector of names supplied by the user.

ulan_id(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")

ulan_data(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")


mdlincoln/ulanr documentation built on May 22, 2019, 4:16 p.m.