In petermeissner/idep: Helper functions and procedures for the IDEP Project

library(idep)
library(dplyr)
library(stringr)
library(magrittr)
load("Z:/Gesch\u00e4ftsordnungen/database/aggregats/isor.Rdata")

# version check 
data_path  <- "z:/gesch\u00e4ftsordnungen/database/extracts/"
db_on_disk_version <- 
  str_extract(list.files(data_path), "\\d+\\.\\d+") %>% 
  as.numeric()  %>% 
  max(na.rm=T)
db_on_disk_version

set.seed(7787)
options("width" = 110)

The ISOR dataset (version `r db_on_disk_version`)

The data-set is based on the IDEP data - data on Standing Orders versions (texts), data on Standing Orders Text (textlines) and data on Standing Orders change between versions (linelinkage). Information from all three sources are aggregated on Standing Orders version level - i.e. each version has its own line in the data-set. This aggregation allows for studying what happened - in an aggregate - at each reform of Standing Orders.

The data set incorporates aggregated data for r dim(isor)[1] - length(unique(isor$t_country)) reforms in r length(unique(substring(isor$t_country,1,3))) countries and consists of r dim(isor)[2] variables.

Example:

isor  %>% 
  select(t_id, wds_clean_rel, wds_chg, pro_min, pro_maj, wds_corp_top_1, pro_minmaj_qual)

Shortcuts

SP is short for subparagraph and in most cases identical to line or text line due to the fact that sub-paragraphs have in the data gathering process been transfered (from PDFs and Word documents and the like) to simple text files in which each sub-paragraph was put on a single line and on one single line only.

SO is short for Standing Orders (rules of procedure) and usually refers to one particular version of Standing Orders in force in a particular country from a particular time point on.

General variable name shortcuts

wds is short for words, like the number of words within a SP.

lns is short for lines an equivalent to SP.

rel is short for relevant, like whether or not a SP contains relevant information or can be dismissed as headline, blank-line, comment, appendix or some other not relevant part of the text (the opposite usually is all)

raw versus clean denote whether or not the SP was used as found in the official documents or cleaned up. Cleaned SPs have been stripped by their enumeration (only at the start of the SP), thus pure changes of enumeration (only at the start of the SP) are not considered as changes nor are they counted as relevant content.

ins, del, mdf, non: denote the type of change that occurred for a particular SP - insertion, deletion, modification and no-change. The shorthand chg is for change and refers to all three possible forms that change might come by.

pro_... indicate minority/majority codings, e.g. pro_min indicates how many SPs were changed in a way considered minority friendly.

Shortcuts indicating their raw data sets

db_ and int_ denote variables that have to do with the database that serves as basis for all data-sets and are most likely not relevant to the user.

t_ information on text or Standing Orders version level, e.g. the data when a particular standing order was accepted, the country it belongs to or total number of words of the version, ... .

tl_ information on text lines aka SPs, e.g whether or not a SP is relevant, how many words it contains, it's corpus code, ... .

ll_ information on the what happened to SPs from one version to the next - e.g. was it changed or deleted.

Other shortcuts or variable names missing the above mentioned shortcuts indicate that those variables have been computed from other variables and most likely were result from some form of aggregation. See below for more detailed descriptions.

Citing the Data

Publications using this dataset should acknowledge in writing that the information comes from:

Sieberer, Ulrich; Meißner, Peter; Keh, Julia; Müller, Wolfgang C. (2015): ISOR - IDEP Standing Orders Reforms Dateset.

Tsebelis, George (2002): Veto Players. How Political Institutions Work. Princeton UP

References used in the Codebook

ERD (2014): European Representative Democracy (ERD) Release 3.0 February 12, 2014 Codebook for ERD - e.

General Notes and Remarks

change measure interpretaion

While SO lengths are given as SP counts (lns) and word counts (wds) and are furthermore subdivided by relevance (rel) and whether or not the word counts are based on raw text or cleaned up text this variance in measures does not exist for changes between SO versions. Since all coding and analysis was only pursued on the basis of relevant and cleaned SPs all change measures refer to SPs that were relevant and cleansed from enumerations.

source("C:/Dropbox/RPackages/idep/inst/tasks/aggregate_data/variable_description_isor.r")
desc_missing <- any( !(   names(isor) %in% var_desc_isor$name ))
desc_tomuch  <- any( !( var_desc_isor$name %in% names(isor)   ))
anymiss <- desc_tomuch | desc_missing

if(desc_missing){
  cat("# Missing variable descriptions")
}else{
  cat("")
}

names(isor)[!(names(isor) %in% var_desc_isor$name)]

if(desc_tomuch){
  cat("# Surplus variable descriptions")
}else{
  cat("")
}

var_desc_isor$name[!(var_desc_isor$name %in% names(isor))]
var_desc_isor <- var_desc_isor[(var_desc_isor$name %in% names(isor)),]

Variable descriptions

sumdummy<-function(x)
{
  ifelse(
    class(x) == "character" | class(x) == "Date", 
    "-", 
    format(sum(x, na.rm=TRUE), scientific = F, big.mark = "\xA0") 
  )
}
sumvar <- function(x){
  cat("**`class    :`** `", 
      str_pad(class(x)               , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat("**`unique   :`** `", 
      str_pad(length(unique(x))      , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat("**`NAs      :`** `", 
      str_pad(sum(is.na(x))          , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat("**`not-NA   :`** `", 
      str_pad(sum(!is.na(x))         , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat("**`not-0-NA :`** `", 
      str_pad(sum(!is.na(x) & x!=0 ) , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat("**`sum      :`** `", 
      str_pad(sumdummy(x)            , 
              12, "left", "\u00A0") ,"`\\\n<br>")
  cat(
      "**`range    :`** `[", 
    as.character(range(x, na.rm=T)[1]), 
    "] ... [",
    as.character(range(x, na.rm=T)[2]),
    "]",
    "`\\\n<br>"
  )
  cat(
      "**`examples :`** `", 
    paste0(substring(
      paste0("[",  paste(sample(x,10), collapse="], [", sep="") ,   "]"),
    0,80),""),
    "...",
    "`\\\n<br>"
  )

}

for( g in sort(unique(var_desc_isor$group)) ){
  gtitle <- str_replace(g, "^\\d* ", "")
  cat(paste0("\n## ", gtitle, "\n\n"))
  dat <- var_desc_isor  %>% filter(group==g)
  for( i in seq_len(dim(dat)[1]) ){
    cat(
      "\n\n**", dat$name[i], "** (", dat$from[i] ,")", "\n\n",
      dat$desc[i], "\n\n",
      sep=""
    )
    x <- as.data.frame(isor, stringsAsFactors=F)[ , dat$name[i] ]
    sumvar(x)
    cat("\n\n<p>&nbsp;</p>")
  }
}