This introduction shows how you could gather meta data (such as title and URL) from the News API and use this information to download the complete article and calculate the sentiment for the text body. First, we download the meta data using the newsanchor
package. Secondly, we write a function that allows us to automatically scrape the text of any "New York Times" (NYT) article from their website. We apply this function on the URLs fetched in the first section. Thirdly, we calculate the sentiment for each article using the AFINN dictionary. While a dictionary approach is probably not ideal to analyze political newspaper articles, this introduction provides an insight how the newsanchor
package can be used for more detailed analyses.
Before we start downloading the actual content we have to load the required packages. Below, you find a short summary of the purpose of each package.
newsanchor
enables us to download necessary meta data. Unfortunately, the free News API account only allows to query data within the last 3 months. For the purpose of this tutorial, we use an internal data set. The data set is also available using the newsanchor::sample_response
. You find detailed information about the sample response object using ?newsanchor::sample_response
.
robotstxt
provides functionalities that automatically read the robots.txt
file of a website. The robots.txt
file allows website administrators to define which scrapers and robots are allowed to visit certain folders within the webiste. While the usage is mainly based on trust, you should definitly always check the file.
httr
is a wrapper for the curl
package and provides functions to query modern web APIs. It allows to easily download websites and access useful information about the connection.
rvest
ships with functions that allow to easily parse HTML to characters. We can search for certain items on the downloaded website and easily access their text and attributes.
dplyr
is a package that provides neat functions to manipulate data frames. It will be essential in our last task: the sentiment calculation. Furthermore, it autmatically loads the magrittr
package with its beautiful pipe operators.
stringr
is a wrapper for string manipulation functions. It provides a consistent grammar. Hence, we prefer it over stringi
and the grep
family.
tidytext
is a tool that provides text manipulation along with the tidy data principles. It works well along with dplyr
and is used to apply the sentiment calculations.
textdata
is used to get the sentiment analyses done, accessing a certain lexicon (AFINN). Users must agree to understand the library’s license/terms of use before the dataset is downloaded.
# load all required packages library(newsanchor) # download newspaper articles library(robotstxt) # get robots.txt library(httr) # http requests library(rvest) # web scraping tools library(dplyr) # easy data frame manipulation library(stringr) # string/character manipulation library(tidytext) # tidy text analysis library(textdata) # contains the AFINN lexicon
First, we have to download the meta data using the get_everything
function of the newsanchor
package. We query the News API for all articles about Donald Trump between 3rd and 9th December 2019 in the NYT. Instead of searching within the NYT, we could also narrow our search by looking for news in a certain language using the language
argument. All available arguments can be seen using ?get_everything
. We assign the result to response
and extract the data frame that includes newspaper articles and corresponding meta data, such as URL, author, title, etc. Unfortunately, the data frame does not include the whole article text. In the following code, we use the advanced function get_everything_all
of newsanchor
. The only difference to get_everything
is that it downloads all available results at once.
# get headlines published by the NYT response <- get_everything_all(query = "Trump", sources = "the-new-york-times", from = "2018-12-03", to = "2018-12-09") # extract response data frame articles <- response$results_df
Since News API does not allow to query results that are older than 3 months, we decided to append a sample data set to the newsanchor
package. Below we show how you load the example data set that equals the above query.
articles <- sample_response$results_df
Before we start downloading a lot of articles from the NYT, we should check if we are allowed to access their website automatically. As explained previously, usually each website provides a robots.txt
file that includes permissions for bots. You can see the file by opening https://www.nytimes.com/robots.txt in your favorite browser. By the way, appending robots.txt
to the root URL should work for every website. The robotstxt
package provides the function paths_allowed()
that returns TRUE
when you are allowed to scrape the site and, vice versa, FALSE
. We test our URL vector. Afterwards, we use all()
. all()
returns TRUE
when all items of a vector are TRUE
. Since all()
yields TRUE
, we are allowed to scrape the given URLs.
allowed <- paths_allowed(articles$url) all(allowed)
We define a function that allows us to download the article body for any given NYT URL. Hence, the function takes only one argument: the URL. First, we download the complete website using the GET()
function from the httr
package. We can check whether the server returned a valid answer. Generally, we accept all responses with a 200 status code. Usually, 4xx codes describe a user error and 5xx errors describe a server error. An useful overview on the mostly used status codes can be found on Wikipedia. If the server returns an error, we return NA
. If the server response is valid, we extract the content of the response. The content()
function returns only the raw HTML code of the website. We can parse the HTML code using the read_html
function so that R "understands" HTML. Subsequently, we define a selector to search for the article text. The selector defines which elements on a website we want to target. You find an introduction to selectors here. Furthermore, there is the very useful selector gadget tool to find selectors on every website. Finally, we can search for the selector using html_nodes()
and extract the content using html_text()
. Additionally, we remove all line breaks using str_replace_all()
and paste/glue the character vector into one big text.
This function is, of course, only a simple sample. We could amend the function with further tests of the returned content and detailed error handling and messages. Additionally, we could vectorize the function and allow users to enter a vector of URLs. This could be done using functions of the apply
family. However, for the purpose of this tutorial, we want to keep the function as simple as possible.
get_article_body <- function (url) { # download article page response <- GET(url) # check if request was successful if (response$status_code != 200) return(NA) # extract html html <- content(x = response, type = "text", encoding = "UTF-8") # parse html parsed_html <- read_html(html) # define paragraph DOM selector selector <- "article#story div.StoryBodyCompanionColumn div p" # parse content parsed_html %>% html_nodes(selector) %>% # extract all paragraphs within class 'article-section' html_text() %>% # extract content of the <p> tags str_replace_all("\n", "") %>% # replace all line breaks paste(collapse = " ") # join all paragraphs into one string }
After we defined a function that is able to scrape NYT articles, we want to apply the function to our list of URLs. We append an empty new column to the data set. Afterwards, we initialize a progress bar which will show us the progress within the loop. Within each loop we apply the function to the i-th URL and save the result to the newly created body column. Additionally, we pause the program for 1 second. You may ask why we talk about apply all the time, but we do not use the apply family? The answer is multifaceted. First, we want to execute the function within a loop because we want to keep track of the progress. Second, we can easily debug our function if we know which URL breaks the function. However, if you write a more advanced function, you could easily replace the loop with an apply function, such as sapply()
.
# create new text column articles$body <- NA # initialize progress bar pb <- txtProgressBar(min = 1, max = nrow(articles), initial = 1, style = 3) # loop through articles and "apply" function for (i in 1:nrow(articles)) { # "apply" function to i url articles$body[i] <- get_article_body(articles$url[i]) # update progress bar setTxtProgressBar(pb, i) # sleep for 1 sec Sys.sleep(1) }
After we finally downloaded all articles, we can calculate the sentiment for each article. We use a simple dictionary approach. A dictionary approach assigns a positive or negative score or label to each word. We use the AFINN dictionary that assigns numerical scores between -5
and 5
to English words. As stated in the introduction, dictionary approaches, especially with non-domain specific dictionaries, might not be the best choice to determine the sentiment of newspaper articles. However, due to the simplicity of this tutorial, we stick to the dictionary approach. In the end, we group the scores by the date they were published and calculate the mean score for each respective day.
sentiment_by_day <- articles %>% select(url, body) %>% # extract required columns unnest_tokens(word, body) %>% # split each article into single words anti_join(get_stopwords(), by = "word") %>% # remove stopwords inner_join(get_sentiments("afinn"), by = "word") %>% # join sentiment scores group_by(url) %>% # group text again by their URL summarise(sentiment = sum(value)) %>% # sum up sentiment scores left_join(articles, by = "url") %>% # add sentiment column to articles select(published_at, sentiment) %>% # extract required columns group_by(date = as.Date(published_at)) %>% # group by date summarise(sentiment = mean(sentiment), n = n()) # calculate summaries
Below, you find the plot that results from our analysis. We can see that most of the articles were published Tuesday and Friday. The least number of articles can be found Saturday and Sunday. Probably there is less staff available during the weekend or Donald Trump is busy playing golf in Mar-a-Lago. Hence, less newsworthy events take place.
Monday, Tuesday and Friday have negative sentiment scores. Digging into the Tuesday's headings, we can see that news do not have a common theme. Friday's results look similar. However, we can find various articles about the Mueller investigation. Saturday's score is outstandingly positive. Hence, one would expect articles about a certain newsworthy positive event. Unfortunately, the result shows us that we cannot find a common theme again.
The analysis above shows that one needs to be very careful with sentiment analysis. It seems that the dictionary approach did not capture the overall atmosphere of the respective day since the articles seem to be very different. One could probably review whether we find anomalies along the authors, the length of the articles or other attributes to explain the strongly varying scores.
# enable two plots in one figure old_par <- par(mfrow=c(1, 2)) # plot number of articles vs. time barplot(height = sentiment_by_day$n, names.arg = format(sentiment_by_day$date, "%a"), ylab = "# of articles", ylim = c(-10, 35), las = 2) # plot sentiment score vs. time barplot(height = sentiment_by_day$sentiment, names.arg = format(sentiment_by_day$date, "%a"), ylab = "Sentiment Score", ylim = c(-10, 35), las = 2)
As stated before, we could have made further improvements to the above code. While the get_article_body()
function provides the article text, it does not differentiate between the actual paragraphs and the headings. We could amend the function so it provides a vector where each item represents either the heading or a paragraph. Due to the simplicity of our anaylsis we did not need such details. Furthermore, we probably could have checked whether the article consists of multiple pages. Currently, our function only returns the main page of the article. However, if the article consists of multiple pages, we miss those. Additionally, we could write functions that enable to extract the comments and also the images of each article. This could be useful for further analysis.
These suggestions might be implemented in future versions of the newsanchor
package to provide easy functions for automated web scraping of online newspaper articles.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.