README.md
In stephbuon/hansardr: Read the Hansard 19th-Century British Parliamentary Debates

hansardr makes it easy to access the parsed debates from The Hansard 19th-Century British Parliamentary Debates with Improved Speaker Names within the R environment.

This is a clean corpus of the 19th-century British Parliamentary Debates (1803-1909), also known as Hansard. It identifies debates whose records are missing from UK Parliament’s corpus, and it also offers a field for disambiguated speakers. We believe these improvements will enable researchers to analyze the Hansard debates, including speaker discourse, in a way that has not been accessible before.

For supplementary materials meant to support the analysis of the Hansard debates, including tokens and their raw counts, bigrams and their raw counts, special vocabulary, speaker metadata, and topics from LDA topic modeling, see our full data set hosted on the Harvard Dataverse.

Install from CRAN:

install.packages("hansardr")

Install from GitHub:

# install.packages("devtools")
library(devtools)
install_github("stephbuon/hansardr")

Now the package can be imported as usual:

library(hansardr)

hansardr comes with a sample data set of 10 rows per decade subset. To download the full corpus, use download_hansard(). The samples will be replaced with data for the entire century.

The Hansard corpus is subsetted by decade. Each decade has four types of data, labeled: "hansard," "debate_metadata," "speaker_metadata," and "file_metadata." In the following table, "YYYY" stands in for any given decade.

| Label | Description | Key | | ------------- | ------------- | ------------- | | hansard_YYYY | Hansard debate text | sentence_id | | debate_metadata_YYYY | Hansard debate metadata such as speechdate and title. | sentence_id | | speaker_metadata_YYYY | Original speaker name, disambiguated speaker name, and more. | sentence_id | | file_metadata_YYYY | Corpus metadata such as IDs for speech, source file, column, and more. | sentence_id |

We also provide keywords lists that were used in scholarly research.

| Label | Description | | ------------- | ------------- | | events | Manually selected list of events and their years |

Load hansardr.

library(hansardr)

Download the entire corpus. This will only need to be done once.

download_hansard()

Read files into the R environment.

data("hansard_1880")

data("debate_metadata_1880")

Constructing a larger data set from each subsection of the data is easy.

Tables can be joined on the sentence_id field, a unique ID assigned to each sentence of the Hansard debates.

combined_hansard_df_1800 <- left_join(hansard_1800, debate_metadata_1800, by = "sentence_id")

Tables can be bound by row using rbind() from base R, or bind_rows() from the tidyverse.

hansard_df_1850_through_1860 <- rbind(hansard_1850, hansard_1860)

library(tidyverse)

hansard_df_1850_through_1860 <- bind_rows(hansard_1850, hansard_1860)

This is the first analysis-ready c19 Hansard corpus with disambiguated speaker names. As described in our research, we use mixed methods (algorithmic and qualitative) to disambiguate speaker names, and we arrive at about an approximate 85% disambiguation rate. If, while using our data set, you find a bug we would appreciate you sharing it with us! You can write an issue on our hansard-speakers repository.

Buongiorno, Steph, 2021, hansardr. Available: https://github.com/stephbuon/hansardr.

stephbuon/hansardr documentation built on March 1, 2023, 6:42 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com