\makeatletter \newcommand{\justified}{% \rightskip\z@skip% \leftskip\z@skip} \makeatother

doc_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
doc_type
if(doc_type=="latex") cat("\\newpage")

Introduction

To collect data on reforms of Standing Orders, many steps had to be taken and many hands had to help. The basic idea was that two proceeding versions of the same text can be compared by putting them side by side and going through each (sub)-paragraph.

library(diffr)
old <- stringr::str_split(
"Three kind mice, see how they run!
They all ran after the farmer's wife,
Who cut off their tails with the carving knife,
Did you ever see such a thing in your life?
As three blind mice.
End", "\n")
new <- 
stringr::str_split(
"Three blind mice, see how they run!
They all ran after the farmer's wife,
they took out some cheese,
and they cut her a slice,
Did you ever see such a sight in your life 
as three kind mice?", "\n")
res <- diffr(old, new, dist="bow")
res_df <- 
  data.frame( 
    lnr1  = res$alignment_df$lnr1, 
    old = paste(substring(res$text1_orig[res$alignment_df$lnr1],1,27),"...") ,
    lnr2  = res$alignment_df$lnr2, 
    new = paste(substring(res$text2_orig[res$alignment_df$lnr2],1,27),"...") ,
    bowdist  = res$alignment_df$distance,
    type  = res$alignment_df$type, 
    stringsAsFactors = FALSE
  )
res_df[is.na(res_df)] <- ""
res_df[res_df=="NA ..."] <- ""
knitr::kable(
  data.frame(
    lnr1  = 1:6, 
    old   = unlist(old),
    lnr2  = 1:6,
    new   = unlist(new)
  ), 
  align=c("r","l","r","l")
)

Some parts might have changed, others might not have changed but were put at different locations. Those parts that have been changed might have been deleted, modified or inserted.

knitr::kable(res_df,align=c("r","l","r","l","r","r"))

To gather changes in that manner, the first task is to acquire all the documents that describe the status or evolution of a particular set of Standing Orders. That step incorporated finding contact persons within the parliaments and checking for completeness and consistency of the provided 'historical' documents.

While intuitively one might think about Standing Orders as explicit documents that are fully written out, most of the time this is only but a small part of the story. While so called consolidated versions exist, most of the time one needs a consolidated version and all the amendments (short, technical instructions of how to transform Standing Orders in place to a new set of Standing Orders) made to that version over time to know which set of rules was in place at a certain point in time. To apply the basic idea, all amendments had to be transformed into consolidated versions.

Documents were provided in differing form and in differing shades of quality. There might be sheets of paper, Books, Word-documents, machine readable PDFs or scans. All those various types were first transformed to Word-documents and later on freed of transformation errors and artifacts in a cleaning step.

After cleaning and consolidation, documents were restructured in such a way that each sub-paragraph corresponded to one line in a plain text file. Furthermore, lines without relevant content such as headlines or notes were marked by #§#. The restructuring made it easy for the documents to be read in by the coding programs used in the following steps.

For comparing Standing Orders effectively we made use of two types of programs: First, there are programs specialized in presenting the comparison of documents to humans. Second, there are programs that are less accessible by humans, but more standardized and therefore better suited to serve as helpers for computer programs. While we found good companions in the first category -- e.g. UltraCompare, the Notepad++ Compare Plugin, DiffDoc, WinMerge, ... see: https://en.wikipedia.org/wiki/Comparison_of_file_comparison_tools, for a list) -- we did not find any tool that suited our needs in the second category -- i.e. indicating line modifications and measuring differences.

Therefore we wrote our own software that helped with comparing texts, assigning change types, and measuring differences. Three programs were written: The first for comparing documents, gathering links between sub-paragraph from one version to the other, assigning change types, and measuring change; the second for coding changes between documents as minority or majority friendly; the third for coding sub-paragraphs into categories capturing the type of regulation.

The data gathered with the help of these programs was then merged into one database with three tables: Meta information on the Standing Orders (texts), the text of the Standing Orders and accompanied data (textlines) and how sub-paragraphs from one version are linked those of another version (textlines). Thereafter, the information were checked for errors. After elimanting all errors, the raw information from the database was then aggregated to various formats.

if(doc_type=="latex") cat("\\newpage")

Document Transformation

The original Standing Order documents that we gathered often were books or came as scans in PDF format. To be able to further work with them, the text had to be brought into computer readable format. This first digital format of the Standing Orders text was Word. Although, the texts later on were further broken down into plain text, Word is a good intermediate format since everyone is familiar with it and it is able to emulate the layout of the original document which eases comparisons between the original and the digital version.

if(doc_type=="latex"){
  cat(
"\\begin{center}
\\includegraphics[height=0.7\\textheight]{fig/fig1.png}
\\end{center}"
      )
}else{
  cat("![List of ammendments to the 1950 SO UK, page 497. 1952 SO Germany word-document](fig/fig1.png)")
}

Figure 1: 1952 German Bundestag Standing Orders word-document

if(doc_type=="latex") cat("\\newpage")

Document Consolidation

In order to construct a data base consisting of all parliamentary standing orders that were in force at a specific point in time, consolidated versions of the standing orders were needed. Consolidated versions are complete versions with all changes that had occurred at a specific date included in the body of the text.

However, there are often only few consolidated versions provided by national parliaments. Changes to the standing orders are most of the time published as amendments and only once in a while a full version of the current text is issued. As a consequence, consolidated versions of the standing orders for every date of change had to be constructed by inserting manually the changes into the previous complete version.

if(doc_type=="latex"){
    cat(
"\\begin{center}
\\includegraphics[height=0.7\\textheight]{fig/fig1a.png}
\\end{center}"
      )
}else{
  cat("![List of ammendments to the 1950 SO UK, page 497](fig/fig1a.png)")
}

Figure 2: List of ammendments to the 1950 SO UK, page 497

if(doc_type=="latex") cat("\\newpage")

Document Cleaning

The aim of this procedure was to identify and correct errors that had occurred through the process of converting PDF-documents into Word-documents. Another purpose was to put the cleaned versions of the parliamentary standing orders in a standardized format while maintaining the original structure of paragraphs. First of all, the oldest consolidated version of Standing Orders was completely read through and corrected manually. Typos, unnecessary line breaks, signs not belonging to the text and everything going beyond the actual text of the Standing Orders was deleted.

Next, a header containing information about the version such as the dates of acceptance, promulgation and enactment of the version of the standing orders was inserted. The so cleaned version served as our reference version. In the following step, the subsequent consolidated version was compared to the reference version using the software DiffDoc (in a later stage of the cleaning process, the software UltraCompare was used instead). The software made it possible to easily identify identical parts of the two versions that only contained few mistakes to be corrected as well as parts that had been changed. The latter were read completely and handled like the first version. After the cleaning of the second version, it served as the new reference version for the subsequent consolidated version of the Standing Orders. Alongside cleaning the text, it was made sure that each sub-paragraph, headline or other structuring element of the standing orders was given a single line. Each element was separated by a line break; no element was allowed to spread more than one line. The steps were repeated for all consolidated versions. Throughout the procedure, the PDF-versions of the Standing Orders were considered in case of uncertainty regarding cleaning decisions. In a last step, the lines (representing text elements) that were of non-relevant content (e.g. headlines) were marked by adding a special string at the start of the line ('#§#') to allow the computer to automatically dismiss these lines later on.

if(doc_type=="latex"){
    cat(
"\\begin{center}
\\includegraphics[width=\\textwidth]{fig/fig2.png}
\\end{center}"
      )
}else{
  cat("![1952 SO Germany plain text version](fig/fig2.png)")
}

Figure 3: 1952 SO Germany plain-text version

if(doc_type=="latex"){
    cat(
"\\begin{center}
\\includegraphics[width=\\textwidth]{fig/fig3a.png}
\\end{center}"
      )
}else{
  cat("![Comparing Standing Orders versions with Ultra Compare](fig/fig3a.png)")
}

Figure 4: Comparing Standing Orders versions with Ultra Compare

if(doc_type=="latex") cat("\\newpage")

Linkage

Description

Having generated complete, cleaned versions of the Standing Orders, the next step was to further prepare the data so that the content of the Standing Orders could be coded in an efficient manner and to get information about what had or had not been changed from one version to the next. For this purpose, changes in the standing orders between versions were linked. This means that for each line of text (each relevant sub-paragraph) it was recorded whether or not the line was deleted in the subsequent version, got inserted in the current version, got changed or simply stayed the same. The coding was done semi-automatically by first letting an algorithm which was developed by the project team and implemented in R handle all non-relevant lines as well as those that were not changed. Thereafter, human coders went through all remaining text lines of two subsequent versions to add linkage information to the data-set. For this purpose, another program implemented in R helped the coders by making sure that: all lines were coded; the information was recorded correctly and alongside the text that was linked, coders were given suggestions for possible matching lines similar to that under consideration. As the linked files constituted the basis for later analyzes and coding, it was crucial to differentiate between minor reformulations of paragraphs (e.g. mere orthographic reforms) and actual changes. In case of doubt the supervisors were consulted.

The process of gathering link information between sub-paragraphs of subsequent Standing Orders versions allows for distinguishing between types of change (deletion, insertion, modification and no-change), measuring its extent more precisely and to later transfer line codes from one version of the Standing Orders to another, so that all sub-paragraphs (the selected coding entity) that were identical in two versions got automatically the same code.

Furthermore, the use of an semi-automatic approach allows to use the strengths of both computers and humans: computers are good at doing the same stuff over and over again in the same and predictable way - e.g. finding identical lines, computing measures of similarity, saving data always in the very same format -- while humans on the other hand have a much better understanding of the content of text, might understand intentions of the authors and are more creative and flexible -- e.g. finding line pairs that might be not very similar based on the sequence of characters or the distribution of words, but in regard to the things that are regulated within them.

library(stringr)
library(diffr)

# defining text
old <- str_split("Three kind mice, see how they run!\nThey all ran after the farmer's wife,\nWho cut off their tails with the carving knife,\nDid you ever see such a thing in your life?\nAs three blind mice.\nEnd", "\n")
new <- str_split("Three blind mice, see how they run!\nThey all ran after the farmer's wife,\nthey took out some cheese,\nand they cut her a slice,\nDid you ever see such a sight in your life\nas three kind mice?", "\n")

# calculating distances, aligning text, determining change types
res <- diffr(old, new, dist="bow")

# distance matrix
res$distance_matrix

# resulting alignment and change type
res$alignment_df

Code Snippet 1: Text comparison with own software written in R

if(doc_type=="latex"){
  cat(
"\\begin{center}
\\includegraphics[width=\\textwidth]{fig/linkage.png}
\\end{center}"
      )
}else{
  cat("![Figure 5: R Software with commandline interface for linking parts of two version of Standing Orders](fig/linkage.png)")
}

Figure 5: R Software with commandline interface for linking parts of two version of Standing Orders

Training and Quality Control

While software was used to help, finding matches between versions was non-trivial. Coders were recruited on the basis that either they were native speakers or had a very high language proficiency. Good knowledge of the government system of the countries was a further requirement. In addition, coders had to undergo training:

if(doc_type=="latex") cat("\\newpage")

Minority-Majority-Change Coding

Description

After having identified which parts of the text were modified, moved around, deleted or inserted, those changes could be coded. For this step, the software drew on the information gathered before in the linkage stage to confront the coder only with changes instead of all the text. Coding decisions - pro majority or pro minority - were recorded and added to the database.

if(doc_type=="latex"){
  cat(
"\\begin{center}
\\includegraphics[width=\\textwidth]{fig/minmaj.png}
\\end{center}"
      )
}else{
  cat("![Figure 6: R Software with commandline interface for coding minority/majority proness of reform](fig/minmaj.png)")
}

Figure 6: R Software with commandline interface for coding minority/majority proness of reform

Training and Quality Assurance

Coding the proness of changes to the parliamentary Standing Orders towards majority or minority is all but trivial. The original Standing Orders of national parliaments are usually only published in the respective national language. Thus, coders were recruited who were either native speakers or non-native speakers with very high language proficiency. Good knowledge about the government system of the countries was a further requirement. In addition, coders had to undergo an intensive training.

if(doc_type=="latex") cat("\\newpage")

Corpus Coding

Description

Based on the linked versions of the standing orders, the content could be coded. The intention of the so called corpus coding process was to assign a single code expressing the content to every legal sub-paragraph of every version of the Standing Orders.

The coding scheme comprises 80 unique single codes belonging to ten different categories (law-making, special decision procedures other than regular law-making, relationship to government, relationship to external offices/institutions apart from the government, generating publicity, internal organization of parliament, change and interpretation of the Standing Orders, general rules regarding formation and legislative session/discontinuity, final provisions, miscellaneous (cannot be coded otherwise)).

Apart from the codes, the coding manual encompasses general rules for coding. As every sub-paragraph got only one code, the coders had to decide which code suites best even if several different codes could be assigned to a sub-paragraph. These decisions were based on a specific hierarchy of codes. Rules which concern the interaction of two actors were attributed to the actor which takes the active part if he has discretion regarding this action. Regular law-making was considered more important than other decision procedures if they were treated together in one sub-paragraph. A further general coding rule was to take the overall context into account instead of just looking at a specific regulations.

Like the other coding processes, corpus coding was done semi-automatically with a self-written program implemented in R. Human coders went through the oldest version of the Standing Orders and assigned the appropriate codes from the coding scheme to every text line to create one fully coded version. The next step was to transfer these codes to the other versions. As in the linking procedure, text lines that have stayed exactly the same from one version to the following had been identified and linked. The R program automatically assigned the same codes to these lines. Thus, the coders only had to go through the not coded text lines of the subsequent version of the Standing Orders (that is the passages that had been changed between two versions) and code them manually. Then, the new codes were transferred. The coders proceeded in this way until all versions of the Standing Orders were completely coded with regard to their content.

Training and Quality Assurance

The original standing orders of national parliaments are usually only published in the respective national language. Thus, coders were recruited who were either native speakers or non-native speakers with very high language proficiency. Good knowledge about the government system of the countries was a further requirement. As corpus coding was a very demanding task, all coders got intensive training.

if(doc_type=="latex"){
  cat(
"\\begin{center}
\\includegraphics[width=\\textwidth]{fig/corpus.png}
\\end{center}"
      )
}else{
  cat("![Figure 7: R Software with GUI interface for coding corpus codes](fig/corpus.png)")
}

Figure 7: R Software with GUI interface for coding corpus codes

if(doc_type=="latex") cat("\\newpage")

Appendix

Manual for Text Cleansing

First steps

Procedure

Information that should be inserted/deleted

Saving the new documents

Using Ultra Compare

if(doc_type=="latex") cat("\\newpage")

Manual for Linking Standing Orders versions

Before linking the changes in R, the following steps have to be taken

1) Save all cleaned versions as txt-files in "Geschäftsordnungen/Coding Changes/CountryFolder/TXT" 2) Make sure the naming of the TXT-files complies to the following scheme: a. Country abbreviation b. minus c. year (4 digits) d. underscore e. month (2 digits) f. underscore g. day (2 digits) h. if and only if there are two versions for the same date indicate them with: i. underscore j. letter (A, B, C, ... for the first, second, ... version on that date, make sure to use CAPITAL Letters) k. Examples: "SWE-2002_09_23.txt", "SWE-2003_01_01_A.txt", "SWE-2003_01_01_B.txt", ... 3) Put #§# (hash paragraph hash space) before all irrelevant lines (e.g. headlines, name of standing orders, date of going into force)

The intuition behind linking the changes is as follows:

Linking in R

1) Open R 2) Open Notepad 3) Execute one of the following functions a. Begin linking: "Coding Changes" b. Continue linking: "Recode Changes" 4) Comply with the prompts of the function a. Type in your name b. Select the original documents (Geschäftsordnungen/Coding Changes/CountryFolder/TXT/txt-file) c. Press enter 5) Link non-identical sub-paragraphs (absolutely identical ones are linked automatically) - 0: deletion/insertion of a sub-paragraph - 1: change of a sub-paragraph - 100: sub-paragraph is 100% identical - -99: line is irrelevant (this option should have become unnecessary through #§# ) - -7: show next match - 321: correct line number can be inserted directly (useful if the proposed lines do not include the correct line) - 666: loop starts again - 987: not sure how sub-paragraph should be coded - 951: end coding --> every subsequent sub-paragraph will be coded with 987 (if you choose to use this option, write down the last "real" 987 to allow you to continue coding at the right point the next time)

FAQ

Linked as 100% identical (i.e. 100)

Linked as change (i.e. 1)

ATTENTION: If in doubt, ALWAYS ask us for help! All of our analyses are based on these linked files!

if(doc_type=="latex") cat("\\newpage")

Coding Scheme for Corpus Coding

Basic Intuition:

Each and every code is exclusive, meaning that one sub-paragraph needs to have one code but one code only. For some codes there are notes on how to decide between multiple codes which may seem appropriate. Sometimes even the coding rules and additional notes will not help to decide between codes. In this case please let us know. Every decision accompanied by doubt should be documented.

Further rules of the game:

Scheme

(1) Law Making

 

Note: SPs that refer to both the plenary sessions and committees are coded as 12x; SPs dealing with both law-making and special decision procedures are coded as 1xx.

(2) Special Decision Procedures other than Regular Law-Making

 

Note: SPs which concern multiple special decision procedures apart from regular law-making are coded as follows: highest priority is given to constitutional matters, second highest priority is given to financial laws and budgeting, third highest priority is given to EU policy and fourth highest priority is given to foreign policy.

(3) Relationship to Government

 

Note: If vote of no confidence and vote of confidence is treated together, the SP is coded as vote of no confidence (32).

(4) Relationship to External Offices/Institutions apart from the Government

(5) Generating Publicity

6 Internal Organization of Parliament

7 Change and Interpretation of the Standing Orders

8 General Rules Regarding Formation and Legislative Session; Discontinuity

9 Final Provisions

10 Miscellaneous (cannot be coded otherwise)

999 Footnotes and Titles Without Relevant Content



petermeissner/idep documentation built on May 25, 2019, 1:53 a.m.