omit_duplicates: omit_duplicates

Description Usage Arguments Details Value

View source: R/omit_duplicates.R

Description

omit_duplicates

Usage

1

Arguments

corpus

A dataframe representing a corpus of downloaded texts generated by build_corpus

strict

Should works be considered duplicates only if they share both the same author's last name and the same city (along with matching title, publication date, and volume number)?

Details

Because the Internet Archive's collection of texts includes many works more than once, the output created by 'build_corpus' will likely contain duplicates. 'omit_duplicates' takes a fairly conservative approach to filtering out these duplicates. By default, the function considers works to be duplicates if the first ten words of the title are identical and they have the same publication date, volume number, and either the same author's last name, or the same city of publication (formatting issues are particularly common for these two pieces of metadata). Setting the 'exact' argument to 'TRUE' will only consider works to be duplicates if they share both the same author's last name and the same city of publication.

Value

A dataframe


mariolaespinosa/historicalnetworks documentation built on Feb. 9, 2022, 12:31 p.m.