knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
The fluctuation in the stock market can be reflected and predicted through the most updated news. Triggered by this idea, this data project(package) helps to collect news data from the New York Times Archive API within a defined time period (from 1981 to current). The potential goal of creating this package is to find a method to collect and combine both sturctured (numerical) data from Yahoo Finance and unsturctured (text) data from NY Times together through web scraping data from APIs and text cleaning to get a decent format of merged data by TF-IDF.
You can install NYTimesArticleStock from github with:
# install.packages("devtools") devtools::install_github("zj2213/NYTimesArticleStock")
library(NYTimesArticleStock) if (!require("pacman")) install.packages("pacman") pacman::p_load("httr", "dplyr", "tm", "tidytext", "DT")
To start with, you may required to register for a developer account on NY Times Api (see https://developer.nytimes.com/) and obtain an API key.
key = [your API AS KEY]
Examples - it may take longer time to collect more data through longer periods. In this example, it takes about 5 minutes to collect all articles for 2 months from NY Times (from 2000-1 to 2000-2).
urls <- makeURL(begin_year = 2000, begin_month = 1, end_year = 2000, end_month = 2) DF <- getDF(urls, Sys.getenv("NYTIMES_AS_KEY")) # saveRDS(DF, "DF.rds") api_data_sample <- DF
datatable(DF[1:5,])
DF %>% select(pub_date, headline, type) %>% group_by(type) %>% count(sort = TRUE) %>% head(10)
filtered <- DF %>% select(pub_date, headline, type) %>% filter(type == "News") %>% group_by(pub_date) %>% summarise(headline = paste(headline, collapse = " ")) filtered$pub_date <- as.Date(filtered$pub_date)
sotu_df <- filtered sotu_clean <- string_cleaning(sotu_df$headline) # Convert to corpus sotu_corpus <- VCorpus(DataframeSource(cbind(sotu_df$pub_date, sotu_clean))) # Stem words sotu_stemmed <- tm_map(sotu_corpus, stemDocument, lazy = TRUE) # Generate TF-IDF matrix sotu_tdm <- TermDocumentMatrix(sotu_stemmed ) # Remove sparsity sotu_m <- as.matrix(sotu_tdm) # Convert TDM to a matrix mat <- as.data.frame(t(sotu_m[-c(1:60),])) final_data_sample <- cbind(sotu_df$pub_date, mat) names(final_data_sample)[1]<-"pub_date"
head(api_data_sample) head(final_data_sample)
https://developer.nytimes.com/archive_api.json
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.