knitr::opts_chunk$set(echo = TRUE)
# devtools::install_github("bergant/datamodelr")
library(datamodelr)

This vignette describes the election violence database .

Overview of Database Structure

The election violence database contains information in four connected sections:

  1. Event Report Coding: The main section of the database contains all the project coding of reports of election violence in nineteenth and early twentieth century newspapers. Most users will be interested only (or primarily) in this part of the database.

  2. Source Material The second section of the database contains information about the source material and search strategies used by the project to find that source material.

  3. Users The third section of the database contains data about users.

  4. Clustering The fourth section of the database contains the data collected during the clustering process which grouped reports of the same event together.

The diagram below shows the overall tabular structure of each section of the database. The diagram shows the primary and foreign key columns in each table, so the figure can serve as a reference for joining tables within and between sections. However, on first read through it is probably enough to note the broad relationship between the four sections of the database, and that there can be multiple tables in each section.

Is this diagram too complex to go here? Could have a simple section level figure of the database here and put this diagram later (perhaps at the end).
maybe put both complex diagram and simple section level figure here? That way users can focus on whichever they need (probably simple in forst instance, but they know where complex it when coming back to it)

dm <- datamodelr::dm_read_yaml("evp_erd.yml")
graph <- dm_create_graph(dm, view_type = "keys_only")
dm_render_graph(graph)

Event Report Coding Tables

The main part of the database contains all the project coding of reports of election violence in nineteenth and early twentieth century newspapers. Most users will be interested only (or primarily) in this part of the database. The codings were generated by expert coders (users) reading newspaper articles (documents) and recording information about reports of election violence events reported in those newspaper articles, together with further classification of the event reports (tags and attributes). This coding is recorded in four related tables:

  1. user_docs This table contains information about the allocation of users to newspaper articles. Each row in this table records the particular combination of user and document (allowing that users were responsible for coding multiple documents and one document could be coded by multiple different users).
  2. event_reports This table stores information about reports of election violence. Each row in this table is a record of a report of an election violence incident (found in a particular newspaper article by a particular user). This is probably the table in the database which users will want to understand first. agree
  3. tags This table stores further information about reported events like the location(s) where it took place, the actors involved in the event, and the kinds of action which constituted the event. Multiple tags of every kind could be associated with one reported event (e.g. an event could be associated with multiple locations, multiple actors and multiple actions).
  4. attributes The tags and attributes tables work together to store classification information about events. Basic classification information is stored in the tags table. The attributes table supplements the tags table with further more detailed information, where this information is associated with particular tags. This two layer structure enables flexible storage of classification information at different levels of granularity and in different ways. For example, if a tag identifies that the event report involes a violent disturbance, attributes might identify whether that this violent disturbance, was described as a riot in the press, had the riot act read and what the crowd size involved with the disturbance was.

The diagram below shows the fields in these four tables, and indicates how these four tables relate to each other.

graph <- dm_create_graph(dm, rankdir="RL", focus=list(tables = c("userdocumentallocation", "event_reports", "tag", "attribute")))
dm_render_graph(graph)

user_docs (userdocumentallocation)

user_doc_id
Unique identifying number for each event report allocated to a specific user.

user_id
Unique user number - see users tables.

document_id
Unique number for each newspaper article - see sources tables.

allocation_date
Date at which the document was allocated to a user for coding.

allocation_type
Whether this allocation was for live coding purposes, or merely for training and testing of coding staff. Values of coding, training, testing.

allocated_by
The nature and method of allocations - usually regular_random_assignment, in which documents were assigned randomly to two or more coders, with a 10% double-coding overlap.

coding_complete
Whether a coder finished categorising and article and submitted it. 1 or 0.

article_type
What type of article the document? Coders could choose from the following:
- Report of Trial or Court Proceedings. Usually summaries, with excerpts of evidence and speeches from lawyers and judges. - Report of Parliamentary Proceedings. Speeches made by MPs and Lords in the Houses of Parliament.
- Letter(s) to Editor.Article takes the form of letters e.g. start with dear Sir, and end with the name of the writer.
- Commentary, Editorial, Opinion-Piece. Generally marked by a more personal, opinionated tone, rather than a 'dispassionate' summary of events. Often make explicit moral judgements on events, with a clear political slant.
- Printed Statement by Candidate or Party. Usually start with the preamble ‘To the Electors of…’, end with the name of the candidate, and take the form similar to that of a letter to the Editor.
- Other. Any type of article that does not correspond to those listed above. NOTE: The majority of articles coded fell into this category.

geo_relevant
Does the article refer to event(s) in England and Wales? True or False.

time_relevant
Does the article refer to event(s) which took place between 1832 and 1914? True or False

electoral_nature
Does the article refer to events which occurred during an election? True or False.

violence_nature
Does this article refer to violent events? True or False.

electoralviolence_nature
Does the article contain information that may be of interest regarding the causes and consequences of electoral violence? True or False.
NOTE: All questions (geo, time, electoral, violence, electoralviolence) must be TRUE to merit the creation of event reports. Various combinations of TRUE and FALSE provide more precise knowledge of why an article chosen manually, or by our text classifier, was irrelevant (e.g. all TRUE, including electoralviolence, but FALSE for geo, usually indicates reports of Irish election violence).

legibility
Whether the coder could read and understand the text of the article, either by clean-enough OCR or PDF image. True or False.

comment_docinfo
Open-text qualitative comment by coder at doc info level, often indicating where there is uncertainty in the coding of other fields in the user_docs table, or the reason why an article is worth further qualitative investigation (see recommend_qualitative)

status
Whether a coder finished categorising and article and submitted it, or is in progress. New, Saved, or Complete.

recommend_qualitative
Qualitative judgement by the coder of whether, given the interests and aims of the project, the article contains information that may be of interest regarding the causes and consequences of electoral violence. True or False.

difficulty_ranking
Qualitative judgement by the coder of how difficult the article was to read, categorise, and analyse.

ideal_coding_comments
Not in use. Is this right?

score
Percentage matching score compared to double-codin of the same document by another user (where applicable - score of 0 where no double-coding exists).

last_updated
That last date at which the article was amended and submitted on the codingp platform.

relevant
Not sure, is it just 5 TRUE comments, or also not by-election, etc?

event_reports

This table stores information about reports of election violence. Each row in this table is a record of a report of an election violence incident (found in a particular newspaper article by a particular user).

event_report_id
Unique identifying number for each event report.

user_doc_id
See user_docs table.

event_type
Field not in use.

environment
Where the event took place - Outdoors, Indoors, Both, Unclear.

event_start
The date at which an event began (or was estimated to begin). Where this was not known, the default value was the date of newspaper publication (see documents table). Often same as event_end. Formatted as yyyy-mm-dd (e.g. 1868-11-24).

event_end
The date at which an event ended (or was estimated to end). Where this was not known, the default value was the date of newspaper publication (see documents table). Often same as event_start. Formatted as yyyy-mm-dd (e.g. 1868-11-24).

comment_events
Open-text qualitative comment by coder, often indicating where there is uncertainty in the coding of other fields in the event reports table.

summary
Open-text qualitative summary of the election violence event. (e.g. 'riots after polling ended in Manchester. Candidate Lejeune attacked by the mob, escaped in carraige. The Riot Act read and military brought in to restore order.') Usually between 1-3 sentences, but occasionally up to a paragraph.

meeting
Whether the event took place at a political meeting, including at the nomination/hustings, private political meetings indoors, public and outdoor meetings held by candidates or parties etc. - True or False or Unclear.

election_point
At what point(s) in the election process did the event occur. Coders were able to select multiple options for events spanning multiple stages of the process.
- Before Polling. This is during the canvass, or any other period explicitly stated to be before the commencement of the actual election.
- Nomination or Hustings. This will be explicitly referred to as nomination or hustings. It is likely to be mentioned alongside the existence of crowds, a platform, long speeches by candidates, and questions from the audience.
- During Polling. This is when voters were actually able to cast their votes for candidates. May only be alluded to (such as ‘during the election’), or such as when polling is suspended due to violence – this indicates that violence and the polling were simultaneous.
- Declaration or Chairing. The point, often held in public and open-air settings, at which the winners and losers are announced. Occasionally, this is followed by the winning candidate being paraded through the town on a chair held aloft by supporters.
- After Declaration or Chairing. Any point after the end of the formal election process.

event_timeframe_quantifier
Related to start and end-dates. Coders could select 'during' for a verified date range, or 'before' if they were unsure about the date of occurrence (e.g. an unspecified day before the default date of newspaper puublication).

autodetected_cluster_id
See clustering tables. this right?

is_exact
Can't remember what this was

latitude
The latitude coordinates of the event location. For method of calculation, see located_from.

longitude
The latitude coordinates of the event location. For method of calculation, see located_from.

election_id
Not in use.

boundary_year
Constituency boundaries changed throughout the period covered by this database. Constituency shapefiles and g_names were soured from Vision of Britain, and this field reflects the set of shapefiles related to the event date. Boundary year options are 1832, 1862, 1868, 1870, 1885.

byelection
A value of 0 indicates that an event relates to a General Election contest. A value of 1 indicates that the event relates to a non-General Election contest. This includes parliamentary by-elections, BUT ALSO, despite the field label, other contests such as local council elections, elections for churchwardens etc.

constituency_g_name
The name of the constituency related to the event. Note that events which relate to a constituency do not always take place in that constituency, and will not be consistent with lag/long corrdinates. This applies in cases where (e.g.) a county polling riot takes place within a borough constituency boundary because a polling station was located there, or events crossed boundaries at the edge of constituencies. These g_names and related stapefiles are sourced from Vision of Britain.

county_g_name
The name of the county related to the event. Should always be consistent with constituency_g_name, though some post-1885 constituency boundaries may overlap multiple counties. These g_names and related shapefiles are sourced from Vision of Britain.

duration
Number of dates between event_start and event_end (usually 0)

election_name
The General Election during which an event occurred. This was automatically calculated using other fields, and should be ignored if a value is contained in this field, where byelection=1.

election_point_clean
Derived from election_point. Centres on polling - events are classed either as before polling, during polling, after polling, or unclear.

end_weekday
The day of the week corresponding to event_end.

geocluster
Can't remember what this was

geometry_type
Can't remember what this was

ignore_geounspecific
Can't remember what this was

located_from
The method by which lat/long coordinates of an event were calculated. This was done (e.g.) by matching to coder's town/village or constituency input [precise_tv and exact_constituency], or using the centrpoint of a county counstituency

start_weekday
The day of the week corresponding to event_start.

event_id
Not in use is this right?

tag

tag_id

event_report_id

tag_table

tag_variable

tag_value

comment

contested

proximity_relative

attribute

attribute_id

tag_id

attribute

attribute_value

contested

Sources Tables

The sources section of the database is designed to store information about the source documents which the reports of election violence are drawn from. Every event report is connected to a particular document (newspaper article). We located these documents by a search process. This search process involved: 1.Entering keywords into online newspaper archive search masks (overwhelmingly using the British Newspaper Archive) 1. Obtaining a set of results, that is the set of (links to) documents returned by the search. 1. Building the set of all newspaper articles returned by any search we conducted, with information about the article (these are the candidate documents - documents which may contain reference to election violence) 1. Selecting documents to code for instances of election violence.

The sources section of the database stores information about these documents, and connects these documents we found back through to the search process by which we found them. The information is stored in the following tables:

  1. archivesearches This table stores information about the keyword searches we conducted.
  2. archivesearchresults This table stores information about the results returned by the online archives in reponse to our keyword searches.
  3. candidatedocuments This table indexes and table stores information about archive search results (so that a candidate document may appear in one or more rows in archivesearchresults).
  4. documents This table stores information about the subset of candidate documents that were considered sufficiently likely to contain election violence incidents and were thus worth reading and coding. Every election violence report in the database is drawn from a document with an entry in this table.

There is one additional table in the sources section of the database: archivesearchsummaryonly. This contains aggregate information (the number of results returned) about some searches. This information does not connect to the event reports. Similar aggregate information, like the number of results returned, about searches which do connect to the event reports can be calculated from the archivesearchresults table.

archivesearch

archivesearch_id

archive

search_text

archive_date_start

archive_date_end

search_batch_id

added_date_start

added_date_end

article_type

exact_phrase

exact_search

exclude_words

front_page

newspaper_title

publication_place

search_all_words

sort_by

tags

archivesearchresult

archivesearchresult_id

archivesearch_id

title

url

description

publication_title

publication_location

publication_date

word_count

type

candidatedocument

candidate_document_id

url

description

publication_title

publication_location

type

status

page

publication_date

ocr

word_count

g_status

status_writer

documents

document_id

candidate_document_id

source_id

doc_title

pdf_location

pdf_page_location

ocr

pdf_thumbnail_location

description

publication_date

publication_location

publication_title

type

url

word_count

archivesearchsummaryonly

archivesearchsummaryonly_id

archive

search_text

archive_date_start

archive_date_end

search_batch_id

search_all_words

exact_phrase

exclude_words

exact_search

publication_place

newspaper_title

added_date_start

added_date_end

article_type

front_page

tags

sort_by

timestamp

graph <- dm_create_graph(dm, focus=list(tables = c("documents", "candidatedocument", "archivesearchsummaryonly", "archivesearchresult", "archivesearch")), rankdir="RL")
dm_render_graph(graph)

Gary/Patrick we need a table/field level description of each of these tables here

Users Tables

The users section of the database contains just one table which store information about database users.

user_id

password

last_login

is_superuser

username

first_name

last_name

email

is_staff

is_active

date_joined

graph <- dm_create_graph(dm, focus=list(tables = c("user")))
dm_render_graph(graph)

Gary/Patrick we need a field level description of this table here

Clustering Tables

Most election violence events were reported many times in the nineteenth and early twentieth century newspapers, we have for example coded more than 150 newspaper articles reporting the 1857 election riot in Kidderminster. The clustering section of the database contains the information on the relationship between different event reports, and particularly which event reports are considered to be about the same underlying event. The clustering was done after all the event reports were collected.

The idea of the clustering process is that every in scope^[Event reports are in scope if the coding is complete, the document is relevant, the event report relates to a general election, and if coding was undertaken in coding (as opposed to training or testing) modes. Out of scope event reports remain in the database but were not clustered.] event report should be a member of exactly one final cluster (combination cluster). The key information about any event report is its final (combination) cluster which is recorded in the combinedclusterentry table, which is linked back to event reports through the verifiedcluster and verifiedclusterentry table.

The database allows for this clustering process to be undertaken multiple times (different cluster attempts). Different cluster attempts are recorded in the clusterattempt_id column of the verified cluster table. The final cluster of an event report should therefore be unique within a particular cluster attempt.

The database contains stores information on multiple partial, and two major cluster attempts. Gary should we add information here about the clusterattempts with ids?. I think so The two main cluster attempts are 301:320, and 401:420 (covering 20 General Elections; 1832=301, 1835=302 etc.). All our our analysis of clustered data is based on attempts 401:420, which in addition to clustering on the platform, have undergone several manula post-clustering cleaning stages.

These clusters were created a sequential grouping and comparison process:

  1. Autoclustering: Event reports with very similar features (e.g. similar locations in the same election) were put into clusters (autoclusters) by an automated algorithm
  2. Verification Coders compare all event reports in an autocluster, splitting as necessary to form a 'verified cluster'.
  3. Reallocation: 'Orphan' event reports which were not part of a larger group at the autocluster stage were compared with all of these 'verified clusters', and reallocated to an existing or new verified cluster.
  4. Combination Verified clusters were compared to each other, and where verified clusters were determined to refere to the same underlying event, verified clusters were combined into larger final or 'combination clusters'.

The clustering process was conducted within within election (i.e. event reports are only considered as possible matches to other event reports assigned to the same general election). Because of the sequential nature of the process (where information in tables was built up gradually by user entry so the information was often needed to processed by users in a partial state) some information is duplicated across tables. The authoritate route to the key information comes from: combinedclusterentry.combinedcluster_id, verifiedcluster.clusterattempt_id and verifiedclusterentry.event_report_id.

A full description of the clustering tables is found in the diagram below:

graph <- dm_create_graph(dm, focus=list(tables = c("userreallocationallocation", "userautodetectedclusterallocation", "verifiedcluster", "verifiedclusterentry", "usercombinationallocation", "combinedclusterentry")))
dm_render_graph(graph)

verifiedcluster

verified_cluster_id

user_alloc_id

reallocation_alloc_id

clusterattempt_id

latitude

longitude

needs_additional_checks

election_id

userautodetectedclusterallocation

user_alloc_id

user_id

verifiedclusterentry

verifiedclusterentry_id

verified_cluster_id

event_report_id

best_description

uncertain

userreallocationallocation

reallocation_alloc_id

event_report_id

user_id

clusterattempt_id

completed

last_updated

usercombinationallocation

usercombinationallocation_id

verifiedcluster_id

user_id

combinedclusterentry

combinedclusterentry_id

verifiedcluster_id

combinedcluster_id



gidonc/durhamevp documentation built on April 8, 2022, 10:31 a.m.