knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

NYTimesArticleStock ---- (Option 1(b))

Author: Steven (Ze) Jia

The fluctuation in the stock market can be reflected and predicted through the most updated news. Triggered by this idea, this data project(package) helps to collect news data from the New York Times Archive API within a defined time period (from 1981 to current). The potential goal of creating this package is to find a method to collect and combine both sturctured (numerical) data from Yahoo Finance and unsturctured (text) data from NY Times together through web scraping data from APIs and text cleaning to get a decent format of merged data by TF-IDF.

Installation

You can install NYTimesArticleStock from github with:

# install.packages("devtools")
devtools::install_github("zj2213/NYTimesArticleStock")

Load the required packages

library(NYTimesArticleStock)

if (!require("pacman")) install.packages("pacman")
pacman::p_load("httr", "dplyr", "tm", "tidytext", "DT")

To start with, you may required to register for a developer account on NY Times Api (see https://developer.nytimes.com/) and obtain an API key.

key = [your API AS KEY]

Scrap text data from NY Times

Examples - it may take longer time to collect more data through longer periods. In this example, it takes about 5 minutes to collect all articles for 2 months from NY Times (from 2000-1 to 2000-2).

urls <- makeURL(begin_year = 2000, begin_month = 1, end_year = 2000, end_month = 2)
DF <- getDF(urls, Sys.getenv("NYTIMES_AS_KEY"))
# saveRDS(DF, "DF.rds")
api_data_sample <- DF

A glimpse of our scrapped dataset

datatable(DF[1:5,])

Some summary statistics about the data

DF %>%
  select(pub_date, headline, type) %>%
  group_by(type) %>%
  count(sort = TRUE) %>%
  head(10)

Grouping text by "NEWS" and date

filtered <- DF %>%
  select(pub_date, headline, type) %>%
  filter(type == "News") %>%
  group_by(pub_date) %>%
  summarise(headline = paste(headline, collapse = " "))
filtered$pub_date <- as.Date(filtered$pub_date)

Text Cleaning

sotu_df <- filtered
sotu_clean <- string_cleaning(sotu_df$headline)

# Convert to corpus
sotu_corpus <- VCorpus(DataframeSource(cbind(sotu_df$pub_date,
                                             sotu_clean)))

# Stem words
sotu_stemmed <- tm_map(sotu_corpus, stemDocument, lazy = TRUE)

# Generate TF-IDF matrix
sotu_tdm <- TermDocumentMatrix(sotu_stemmed
                               )
# Remove sparsity
sotu_m <- as.matrix(sotu_tdm) # Convert TDM to a matrix
mat <- as.data.frame(t(sotu_m[-c(1:60),]))

final_data_sample <- cbind(sotu_df$pub_date, mat)
names(final_data_sample)[1]<-"pub_date"

Check Datasets

head(api_data_sample)
head(final_data_sample)

References

https://developer.nytimes.com/archive_api.json



zj2213/NYTimesArticleStock documentation built on Jan. 6, 2021, 11:51 p.m.