knitr::opts_chunk$set(echo = TRUE) # devtools::install_github("bergant/datamodelr") library(datamodelr)
This vignette describes the election violence database .
The election violence database contains information in four connected sections:
Event Report Coding: The main section of the database contains all the project coding of reports of election violence in nineteenth and early twentieth century newspapers. Most users will be interested only (or primarily) in this part of the database.
Source Material The second section of the database contains information about the source material and search strategies used by the project to find that source material.
Users The third section of the database contains data about users.
Clustering The fourth section of the database contains the data collected during the clustering process which grouped reports of the same event together.
The diagram below shows the overall tabular structure of each section of the database. The diagram shows the primary and foreign key columns in each table, so the figure can serve as a reference for joining tables within and between sections. However, on first read through it is probably enough to note the broad relationship between the four sections of the database, and that there can be multiple tables in each section.
Is this diagram too complex to go here? Could have a simple section level figure of the database here and put this diagram later (perhaps at the end).
maybe put both complex diagram and simple section level figure here? That way users can focus on whichever they need (probably simple in forst instance, but they know where complex it when coming back to it)
dm <- datamodelr::dm_read_yaml("evp_erd.yml") graph <- dm_create_graph(dm, view_type = "keys_only") dm_render_graph(graph)
The main part of the database contains all the project coding of reports of election violence in nineteenth and early twentieth century newspapers. Most users will be interested only (or primarily) in this part of the database. The codings were generated by expert coders (users) reading newspaper articles (documents) and recording information about reports of election violence events reported in those newspaper articles, together with further classification of the event reports (tags and attributes). This coding is recorded in four related tables:
The diagram below shows the fields in these four tables, and indicates how these four tables relate to each other.
graph <- dm_create_graph(dm, rankdir="RL", focus=list(tables = c("userdocumentallocation", "event_reports", "tag", "attribute"))) dm_render_graph(graph)
user_doc_id
Unique identifying number for each event report allocated to a specific user.
user_id
Unique user number - see users tables.
document_id
Unique number for each newspaper article - see sources tables.
allocation_date
Date at which the document was allocated to a user for coding.
allocation_type
Whether this allocation was for live coding purposes, or merely for training and testing of coding staff. Values of coding, training, testing.
allocated_by
The nature and method of allocations - usually regular_random_assignment, in which documents were assigned randomly to two or more coders, with a 10% double-coding overlap.
coding_complete
Whether a coder finished categorising and article and submitted it. 1 or 0.
article_type
What type of article the document? Coders could choose from the following:
- Report of Trial or Court Proceedings. Usually summaries, with excerpts of evidence and speeches from lawyers and judges.
- Report of Parliamentary Proceedings. Speeches made by MPs and Lords in the Houses of Parliament.
- Letter(s) to Editor.Article takes the form of letters e.g. start with dear Sir, and end with the name of the writer.
- Commentary, Editorial, Opinion-Piece. Generally marked by a more personal, opinionated tone, rather than a 'dispassionate' summary of events. Often make explicit moral judgements on events, with a clear political slant.
- Printed Statement by Candidate or Party. Usually start with the preamble ‘To the Electors of…’, end with the name of the candidate, and take the form similar to that of a letter to the Editor.
- Other. Any type of article that does not correspond to those listed above. NOTE: The majority of articles coded fell into this category.
geo_relevant
Does the article refer to event(s) in England and Wales? True or False.
time_relevant
Does the article refer to event(s) which took place between 1832 and 1914? True or False
electoral_nature
Does the article refer to events which occurred during an election? True or False.
violence_nature
Does this article refer to violent events? True or False.
electoralviolence_nature
Does the article contain information that may be of interest regarding the causes and consequences of electoral violence? True or False.
NOTE: All questions (geo, time, electoral, violence, electoralviolence) must be TRUE to merit the creation of event reports. Various combinations of TRUE and FALSE provide more precise knowledge of why an article chosen manually, or by our text classifier, was irrelevant (e.g. all TRUE, including electoralviolence, but FALSE for geo, usually indicates reports of Irish election violence).
legibility
Whether the coder could read and understand the text of the article, either by clean-enough OCR or PDF image. True or False.
comment_docinfo
Open-text qualitative comment by coder at doc info level, often indicating where there is uncertainty in the coding of other fields in the user_docs table, or the reason why an article is worth further qualitative investigation (see recommend_qualitative)
status
Whether a coder finished categorising and article and submitted it, or is in progress. New, Saved, or Complete.
recommend_qualitative
Qualitative judgement by the coder of whether, given the interests and aims of the project, the article contains information that may be of interest regarding the causes and consequences of electoral violence. True or False.
difficulty_ranking
Qualitative judgement by the coder of how difficult the article was to read, categorise, and analyse.
ideal_coding_comments
Not in use. Is this right?
score
Percentage matching score compared to double-codin of the same document by another user (where applicable - score of 0 where no double-coding exists).
last_updated
That last date at which the article was amended and submitted on the codingp platform.
relevant
Not sure, is it just 5 TRUE comments, or also not by-election, etc?
This table stores information about reports of election violence. Each row in this table is a record of a report of an election violence incident (found in a particular newspaper article by a particular user).
event_report_id
Unique identifying number for each event report.
user_doc_id
See user_docs table.
event_type
Field not in use.
environment
Where the event took place - Outdoors, Indoors, Both, Unclear.
event_start
The date at which an event began (or was estimated to begin). Where this was not known, the default value was the date of newspaper publication (see documents table). Often same as event_end. Formatted as yyyy-mm-dd (e.g. 1868-11-24).
event_end
The date at which an event ended (or was estimated to end). Where this was not known, the default value was the date of newspaper publication (see documents table). Often same as event_start. Formatted as yyyy-mm-dd (e.g. 1868-11-24).
comment_events
Open-text qualitative comment by coder, often indicating where there is uncertainty in the coding of other fields in the event reports table.
summary
Open-text qualitative summary of the election violence event. (e.g. 'riots after polling ended in Manchester. Candidate Lejeune attacked by the mob, escaped in carraige. The Riot Act read and military brought in to restore order.') Usually between 1-3 sentences, but occasionally up to a paragraph.
meeting
Whether the event took place at a political meeting, including at the nomination/hustings, private political meetings indoors, public and outdoor meetings held by candidates or parties etc. - True or False or Unclear.
election_point
At what point(s) in the election process did the event occur. Coders were able to select multiple options for events spanning multiple stages of the process.
- Before Polling. This is during the canvass, or any other period explicitly stated to be before the commencement of the actual election.
- Nomination or Hustings. This will be explicitly referred to as nomination or hustings. It is likely to be mentioned alongside the existence of crowds, a platform, long speeches by candidates, and questions from the audience.
- During Polling. This is when voters were actually able to cast their votes for candidates. May only be alluded to (such as ‘during the election’), or such as when polling is suspended due to violence – this indicates that violence and the polling were simultaneous.
- Declaration or Chairing. The point, often held in public and open-air settings, at which the winners and losers are announced. Occasionally, this is followed by the winning candidate being paraded through the town on a chair held aloft by supporters.
- After Declaration or Chairing. Any point after the end of the formal election process.
event_timeframe_quantifier
Related to start and end-dates. Coders could select 'during' for a verified date range, or 'before' if they were unsure about the date of occurrence (e.g. an unspecified day before the default date of newspaper puublication).
autodetected_cluster_id
See clustering tables. this right?
is_exact
Can't remember what this was
latitude
The latitude coordinates of the event location. For method of calculation, see located_from.
longitude
The latitude coordinates of the event location. For method of calculation, see located_from.
election_id
Not in use.
boundary_year
Constituency boundaries changed throughout the period covered by this database. Constituency shapefiles and g_names were soured from Vision of Britain, and this field reflects the set of shapefiles related to the event date. Boundary year options are 1832, 1862, 1868, 1870, 1885.
byelection
A value of 0 indicates that an event relates to a General Election contest. A value of 1 indicates that the event relates to a non-General Election contest. This includes parliamentary by-elections, BUT ALSO, despite the field label, other contests such as local council elections, elections for churchwardens etc.
constituency_g_name
The name of the constituency related to the event. Note that events which relate to a constituency do not always take place in that constituency, and will not be consistent with lag/long corrdinates. This applies in cases where (e.g.) a county polling riot takes place within a borough constituency boundary because a polling station was located there, or events crossed boundaries at the edge of constituencies. These g_names and related stapefiles are sourced from Vision of Britain.
county_g_name
The name of the county related to the event. Should always be consistent with constituency_g_name, though some post-1885 constituency boundaries may overlap multiple counties. These g_names and related shapefiles are sourced from Vision of Britain.
duration
Number of dates between event_start and event_end (usually 0)
election_name
The General Election during which an event occurred. This was automatically calculated using other fields, and should be ignored if a value is contained in this field, where byelection=1.
election_point_clean
Derived from election_point. Centres on polling - events are classed either as before polling, during polling, after polling, or unclear.
end_weekday
The day of the week corresponding to event_end.
geocluster
Can't remember what this was
geometry_type
Can't remember what this was
ignore_geounspecific
Can't remember what this was
located_from
The method by which lat/long coordinates of an event were calculated. This was done (e.g.) by matching to coder's town/village or constituency input [precise_tv and exact_constituency], or using the centrpoint of a county counstituency
start_weekday
The day of the week corresponding to event_start.
event_id
Not in use is this right?
tag_id
event_report_id
tag_table
tag_variable
tag_value
comment
contested
proximity_relative
attribute_id
tag_id
attribute
attribute_value
contested
The sources section of the database is designed to store information about the source documents which the reports of election violence are drawn from. Every event report is connected to a particular document (newspaper article). We located these documents by a search process. This search process involved: 1.Entering keywords into online newspaper archive search masks (overwhelmingly using the British Newspaper Archive) 1. Obtaining a set of results, that is the set of (links to) documents returned by the search. 1. Building the set of all newspaper articles returned by any search we conducted, with information about the article (these are the candidate documents - documents which may contain reference to election violence) 1. Selecting documents to code for instances of election violence.
The sources section of the database stores information about these documents, and connects these documents we found back through to the search process by which we found them. The information is stored in the following tables:
There is one additional table in the sources section of the database: archivesearchsummaryonly. This contains aggregate information (the number of results returned) about some searches. This information does not connect to the event reports. Similar aggregate information, like the number of results returned, about searches which do connect to the event reports can be calculated from the archivesearchresults table.
archivesearch_id
archive
search_text
archive_date_start
archive_date_end
search_batch_id
added_date_start
added_date_end
article_type
exact_phrase
exact_search
exclude_words
front_page
newspaper_title
publication_place
search_all_words
sort_by
tags
archivesearchresult_id
archivesearch_id
title
url
description
publication_title
publication_location
publication_date
word_count
type
candidate_document_id
url
description
publication_title
publication_location
type
status
page
publication_date
ocr
word_count
g_status
status_writer
document_id
candidate_document_id
source_id
doc_title
pdf_location
pdf_page_location
ocr
pdf_thumbnail_location
description
publication_date
publication_location
publication_title
type
url
word_count
archivesearchsummaryonly_id
archive
search_text
archive_date_start
archive_date_end
search_batch_id
search_all_words
exact_phrase
exclude_words
exact_search
publication_place
newspaper_title
added_date_start
added_date_end
article_type
front_page
tags
sort_by
timestamp
graph <- dm_create_graph(dm, focus=list(tables = c("documents", "candidatedocument", "archivesearchsummaryonly", "archivesearchresult", "archivesearch")), rankdir="RL") dm_render_graph(graph)
Gary/Patrick we need a table/field level description of each of these tables here
The users section of the database contains just one table which store information about database users.
user_id
password
last_login
is_superuser
username
first_name
last_name
is_staff
is_active
date_joined
graph <- dm_create_graph(dm, focus=list(tables = c("user"))) dm_render_graph(graph)
Gary/Patrick we need a field level description of this table here
Most election violence events were reported many times in the nineteenth and early twentieth century newspapers, we have for example coded more than 150 newspaper articles reporting the 1857 election riot in Kidderminster. The clustering section of the database contains the information on the relationship between different event reports, and particularly which event reports are considered to be about the same underlying event. The clustering was done after all the event reports were collected.
The idea of the clustering process is that every in scope^[Event reports are in scope if the coding is complete, the document is relevant, the event report relates to a general election, and if coding was undertaken in coding (as opposed to training or testing) modes. Out of scope event reports remain in the database but were not clustered.] event report should be a member of exactly one final cluster (combination cluster). The key information about any event report is its final (combination) cluster which is recorded in the combinedclusterentry table, which is linked back to event reports through the verifiedcluster and verifiedclusterentry table.
The database allows for this clustering process to be undertaken multiple times (different cluster attempts). Different cluster attempts are recorded in the clusterattempt_id column of the verified cluster table. The final cluster of an event report should therefore be unique within a particular cluster attempt.
The database contains stores information on multiple partial, and two major cluster attempts. Gary should we add information here about the clusterattempts with ids?. I think so The two main cluster attempts are 301:320, and 401:420 (covering 20 General Elections; 1832=301, 1835=302 etc.). All our our analysis of clustered data is based on attempts 401:420, which in addition to clustering on the platform, have undergone several manula post-clustering cleaning stages.
These clusters were created a sequential grouping and comparison process:
The clustering process was conducted within within election (i.e. event reports are only considered as possible matches to other event reports assigned to the same general election). Because of the sequential nature of the process (where information in tables was built up gradually by user entry so the information was often needed to processed by users in a partial state) some information is duplicated across tables. The authoritate route to the key information comes from: combinedclusterentry.combinedcluster_id, verifiedcluster.clusterattempt_id and verifiedclusterentry.event_report_id.
A full description of the clustering tables is found in the diagram below:
graph <- dm_create_graph(dm, focus=list(tables = c("userreallocationallocation", "userautodetectedclusterallocation", "verifiedcluster", "verifiedclusterentry", "usercombinationallocation", "combinedclusterentry"))) dm_render_graph(graph)
verified_cluster_id
user_alloc_id
reallocation_alloc_id
clusterattempt_id
latitude
longitude
needs_additional_checks
election_id
user_alloc_id
user_id
verifiedclusterentry_id
verified_cluster_id
event_report_id
best_description
uncertain
reallocation_alloc_id
event_report_id
user_id
clusterattempt_id
completed
last_updated
usercombinationallocation_id
verifiedcluster_id
user_id
combinedclusterentry_id
verifiedcluster_id
combinedcluster_id
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.