README.md
In williamrohdemadsen/dev-public-opinion: Estimates public opinion in developing countries using social media data

Estimating public opinion in developing countries using Twitter data

This project outlines a data-driven strategy and its feasibility and limitations in estimating public opinion in English-speaking developing countries.

This may have consequences for the idea of political and electoral accountability, the social contract, depending on how our strategy can reliably produce estimates of public opinion. These estimates may serve as a comparison against official electoral results.

This is my thesis at UCL.

To reproduce my analysis and pipeline, you can download my R package. It was created to help promote open-source, scalable research.

library(devtools)
devtools::install_github("wrmadsen/dev-public-opinion", build_vignettes = TRUE)
library(devpublicopinion)

Data collection: Tweets, covariates, electoral results, survey results, and language demographics
Tweet analysis:
- Sentiment analysis
- Higher and lower level covariates
Public opinion prediction
- Machine learning with training and validation data
- Multilevel regression with post-stratification
Validation
- Election data
- Events data
- Differences by regions, characteristics, events, etc.

Python's twint module allows us to collect Tweets in a scalable way. With R's reticulate package, I run our Python script from R.

The Python module multiprocessing allows us to cut down on time spent getting tweets. It runs collections in parallel, utilising multiple threads on your computer. The Python script is sourced from R, but it is not run within R using reticulate as that R package does not allow for multiprocessing.

For now, I use the Pool method of multiprocessing rather than Process. I have not tested which is quicker.

This issue may be I/O bound, so it may make sense to use more threads than is available. This article suggests that multithreading may be better since the task is I/O heavy. That might mean I need to look into using the threading module.

Difference between getting tweets one each day or over multiple days. Collecting "Buhari" during the first two weeks of January 2015: 1. per 1 day, threads = 7: 351.08 seconds (188 MB), which is 0,54 MB per second 2. per 2 days, threads = 7: 457.49 seconds (222 MB), which is 0,49 MB per second 3. per 7 days, threads = 7: 1301.32 seconds (266 MB, since 7-day-periods stretched beyond) 4. per 1 day, threads = 14: 296.17 seconds (201 MB) 5. per 1 day, threads = 30: 300.39 seconds (201 MB) 6. per 12 hours, threads = 14: 287.37 seconds (118 MB) 7. per 12 hours, threads = 30: 286.12 seconds (118 MB)

Imposing limit = 20000 (per 12 hours) on the 6th scenario cuts the time to 199.5 seconds while only reducing tweets to 116 MB. limit = 10000 cuts it to 104.68 seconds and 102 MB. Note that the limit in the 6th scenario refers to the number of tweets per 12-hour-interval. limit = 5000 cuts it to 54.8 seconds and 61 MB. These are for non-geocoded tweets.

A takeaway may be to impose a limit on non-geocoded tweets to save time and gather geocodes tweets, which have less volume, without a limit.

Which countries?

Group: Nigeria, Zimbabwe, Afghanistan, Mozambique, and Georgia. Differences in English-speaking proportion, number of Twitter users, electoral corruption and other characteristics can affect the accuracy of Twitter-based public opinion predictions.

Rotating proxies: Robin Hood method:

It may be necessary to automatically change IP proxies during the collection.

The following resources were either used or considered during the text analysis: Tidy Text: https://www.tidytextmining.com/index.html tokenizers package: https://cran.r-project.org/web/packages/tokenizers/index.html

Age, race, gender: https://github.com/wri/demographic-identifier
Gender: Use census date for each country. Liu and Ruths' (2013) gender-name association score between -1 to 1 could work. http://www.namepedia.org/en/firstname/

Depends on what is available for each country. For example, if a country's official election results are not reliable, there are other sources to look at, such as polling or election complaints.

Elections complaints: Validation could also be done for certain countries if they publish data on election complaints. One hypothesis could revolve around a positive correlation between the difference of the official election vote rate and the Twitter prediction against the number of complaints. Afghanistan publishes data on complaints.
Compare against a country with rich polling data, e.g. US or UK.

Afghanistan:

About the 2019 election, "Saturday’s vote was marred by violence, Taliban threats and widespread allegations of mismanagement and abuse" by Gannon. Investigate if predictions can somehow be validated by comparing to province-level death tolls, Taliban control.

Compare with electoral complaints on province-level.