knitr::opts_chunk$set( echo = TRUE , eval = TRUE , warning = FALSE , message = FALSE , fig.align = 'center' , fig.height = 4 , fig.width = 8 )
In this short tutorial, I show how to use functions provided by the manifestoEnhanceR
package can be used to convert a large batch of manifesto documents into XML documents enriched with document structure (headers, sentences, quasi-sentence nesting) and meta data (e.g, manifesto title or quasi-sentence CMP code).
I use the entire Swiss manifesto corpus and you'll be guided step by step through the process of
manifestoR
package's custom data objects into a tidy data format (i.e., a data frame), Before we beginn with this, a short note on why one would want to convert manifestoR
data object to XML files.
By enhancing manifesto data frames, we reconstruct a manifestos document structure. If you look at a manifesto in its print or online version, it starts with a title, then comes an introduction or summary of key points, following by a number of chapters that comprise many more sentences.
This document structure is not reflected in the way how the Manifesto Project makes its data available. While this is not a problem if you are merely interested in their codings, it can be a problem in case you want to use the manifesto test for other puporses, such as topic modeling or for training a word embedding model. For these latter puproses, quasi-sentences are not an ideal unit of (dis)aggregation. Sentences, paragraphs or chapters seem more appropriate.
Now, it turns out while paragraph structure cannot be recovered, we can infer sentences and chapter structure by combining the text of quasi-sentences and their codings.
This is the purpose of the manifestoEnhanceR
package.
# if not installed: `install.packages("manifestoR")` library(manifestoR) # if not installed: `devtools::install_github("haukelicht/manifestoEnhanceR)` library(manifestoEnhanceR) library(dplyr) library(tidyr) library(purrr) library(xml2) library(rvest)
# set Manifesto Project API key (see `?mp_setapikey`) mp_setapikey(file.path(Sys.getenv("SECRETS_PATH"), "manifesto_apikey.txt"))
# get CMP party-year data cmp <- mp_maindataset(south_america = FALSE) # subset to Swiss configurations ch_configs <- cmp %>% filter(countryname == "Switzerland") %>% select(1:10) # get Swiss manifestos ch <- mp_corpus(countryname == "Switzerland", cache = TRUE)
The object ch
is a 'ManifestoCorpus' object, which is essentially a list 'ManifestoDocument' objects with some additional fancy attributes.
class(ch) str(head(ch, 2), 1) lapply(head(ch, 2), class)
We can tidy this up (i.e., convert it to a long tibble) by applying the first manifestoEnhanceR function we encounter in this tutorial: as_tibble.ManifestoCorpus
man_dfs <- as_tibble(ch) head(man_dfs)
man_dfs
is a tibble
(a fancy dataframe) with a list-column called 'data'.
For each row, column data contains a manifesto tibble.
To this manifesto-level tibble, we can left-join manifesto- and party-level data:
man_sents <- man_dfs %>% left_join( ch_configs %>% transmute(manifesto_id = paste0(party, "_", date), party, partyname, partyabbrev) %>% unique() )
To illustrate the other manifestoEnhanceR functions, we take four manifestos from the above corpus, each with some particular features (as commented below).
test1_df <- unnest(filter(man_sents, manifesto_id == "43120_201110"), data) test2_df <- unnest(filter(man_sents, manifesto_id == "43110_198710"), data) test3_df <- unnest(filter(man_sents, manifesto_id == "43110_199910"), data) test4_df <- unnest(filter(man_sents, manifesto_id == "43120_201510"), data)
test2_df
has just one row, and the single cell of column 'text' contains the entire, uncoded text of the manifesto. This is typical for very old manifestos or manifestos of very small parties, where the Manifesto Project team assigned no expert to split the manifesto in quasi-sentences and code it. We say that this manifesto has not been 'annotated'.test3_df
, test2_df
and test_df
have only numerical codes, that is, no 'H' code indicating document headers and titles.test3_df
and test4_df
have both titles, but in the former case which rows hold the manifesto title needs to be inferred from the ordering of text lines.This sounds confusing? It is. But no worries. Function enhance_manifesto_df()
handles all these intricacies for you as shown below.
enhance_manifesto_df()
returns a manifesto.df
object that inherits from the input tibble
, and simply adds four columns indicating document structure:
In addition, the returned manifesto.df
obejct has two additional attributes:
By enhancing manifesto data frames, we add bloc, sentence and quasi-sentence (qs) counters, as well as a role indicators ('qs', 'header', 'title', or 'meta').
As one natrual sentence may contain multiple quasi-sentences, the latter map m:1 to the former. A bloc, in turn, is a number of consecutive rolws that all have the same 'role'. The following rows are defined:
This information is important if you want to count the number of chapters in a manifesto (= No. headers), or print only the title (if exists). And as noted above, having at hand this information, one could easily aggregate quasi-sentences at the sentence or chapter level while omitting title and other meta text.
test1_res <- enhance_manifesto_df(test1_df) select(test1_res, 1:4, 17:25) class(test1_res) attr(test1_res, "annotated") attr(test1_res, "extra_cols") "title" %in% test1_res$role
test2_res <- enhance_manifesto_df(test2_df) select(test2_res, 1:4, 17:25) class(test2_res) attr(test2_res, "annotated")
test3_res <- enhance_manifesto_df(test3_df) class(test3_res) attr(test3_res, "annotated") "title" %in% test3_res$role
test4_res <- enhance_manifesto_df(test4_df) class(test4_res) attr(test4_res, "annotated") "title" %in% test4_res$role
Having enhanced our test case data frames, we can now use manifesto_df_to_xml()
to convert them to XML.
You have basically two options
parse = FALSE
returns the XML-formatted stringparse = TRUE
(the default) returns an {xml2} xml_document
object.# test case 1: as string test1_xml <- manifesto_df_to_xml(test1_res, parse = FALSE) class(test1_xml) length(test1_xml) test1_xml <- manifesto_df_to_xml(filter(test1_res, bloc_nr < 3), parse = FALSE) # cat(test1_xml) # as XML document test1_xml <- manifesto_df_to_xml(test1_res) test1_xml <- manifesto_df_to_xml(man.df = filter(test1_res, bloc_nr %in% 9:9)) class(test1_xml) test1_xml
# test case 2 test2_xml <- manifesto_df_to_xml(test2_res, parse = FALSE) cat(test2_xml) test2_xml <- manifesto_df_to_xml(test2_res) class(test2_xml) test2_xml
# test case 3 test3_xml <- manifesto_df_to_xml(man.df = test3_res, parse = FALSE) class(test3_xml); length(test3_xml) test3_xml <- manifesto_df_to_xml(test3_res) class(test3_xml) test3_xml html_node(test3_xml, "title")
# test case 4 test4_xml <- manifesto_df_to_xml(man.df = test4_res, parse = FALSE) class(test4_xml); length(test4_xml) test4_xml <- manifesto_df_to_xml(test4_res) class(test4_xml) test4_xml html_node(test4_xml, "title")
As shown in the last code block, you can access the different nodes and attributes of the returned XML documents using {rvest}'s html_*
.
The below code simply combines the above steps in a single pipeline.
man_xmls <- man_sents %>% # drop unneccessary columns select(-has_eu_code, -may_contradict_core_dataset, -md5sum_text, -md5sum_original, -annotations, -id) %>% # unnest data unnest(data) %>% # split by manifesto ID (-> list od DFs) split(.$manifesto_id) %>% # apply enhance map(enhance_manifesto_df) %>% # apply converter map(safely(manifesto_df_to_xml)) %>% # gather all in list-column data frame enframe() %>% transmute( manifesto_id = name , xml = map(value, "result") , errors = map(map(value, "error"), "message") , n_errors = lengths(errors) )
Calling this code returns a tibble with list-column 'xml' containing the individual manifesto XML documents.
# the returned tibble contains xml_documents in the 'xml' list-column head(man_xmls, 3)
To make sure that all covnersions were successful, you can look at column 'n_errors':
# there were no errors raised man_xmls %>% filter(n_errors > 0) %>% unnest(errors)
You can alos loop over XML documents and write them to disk.
# write to disk map2( man_xmls$xml , man_xmls$manifesto_id , function(xml, id, path = file.path("your", "local", "path")) { write_xml(xml, file.path(path, paste0(id, ".xml"))) })
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.