Home

/

GitHub

/

In zj2213/NYTimesArticleStock: NY Times Article Archive

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

NYTimesArticleStock ---- (Option 1(b))

Author: Steven (Ze) Jia

The fluctuation in the stock market can be reflected and predicted through the most updated news. Triggered by this idea, this data project(package) helps to collect news data from the New York Times Archive API within a defined time period (from 1981 to current). The potential goal of creating this package is to find a method to collect and combine both sturctured (numerical) data from Yahoo Finance and unsturctured (text) data from NY Times together through web scraping data from APIs and text cleaning to get a decent format of merged data by TF-IDF.

Installation

You can install NYTimesArticleStock from github with:

# install.packages("devtools")
devtools::install_github("zj2213/NYTimesArticleStock")

Load the required packages

library(NYTimesArticleStock)

if (!require("pacman")) install.packages("pacman")
pacman::p_load("httr", "dplyr", "tm", "tidytext", "DT")

To start with, you may required to register for a developer account on NY Times Api (see https://developer.nytimes.com/) and obtain an API key.

key = [your API AS KEY]

Scrap text data from NY Times

Examples - it may take longer time to collect more data through longer periods. In this example, it takes about 5 minutes to collect all articles for 2 months from NY Times (from 2000-1 to 2000-2).

urls <- makeURL(begin_year = 2000, begin_month = 1, end_year = 2000, end_month = 2)
DF <- getDF(urls, Sys.getenv("NYTIMES_AS_KEY"))
# saveRDS(DF, "DF.rds")
api_data_sample <- DF

A glimpse of our scrapped dataset

datatable(DF[1:5,])

Some summary statistics about the data

DF %>%
  select(pub_date, headline, type) %>%
  group_by(type) %>%
  count(sort = TRUE) %>%
  head(10)

Grouping text by "NEWS" and date

filtered <- DF %>%
  select(pub_date, headline, type) %>%
  filter(type == "News") %>%
  group_by(pub_date) %>%
  summarise(headline = paste(headline, collapse = " "))
filtered$pub_date <- as.Date(filtered$pub_date)

Text Cleaning

sotu_df <- filtered
sotu_clean <- string_cleaning(sotu_df$headline)

# Convert to corpus
sotu_corpus <- VCorpus(DataframeSource(cbind(sotu_df$pub_date,
                                             sotu_clean)))

# Stem words
sotu_stemmed <- tm_map(sotu_corpus, stemDocument, lazy = TRUE)

# Generate TF-IDF matrix
sotu_tdm <- TermDocumentMatrix(sotu_stemmed
                               )
# Remove sparsity
sotu_m <- as.matrix(sotu_tdm) # Convert TDM to a matrix
mat <- as.data.frame(t(sotu_m[-c(1:60),]))

final_data_sample <- cbind(sotu_df$pub_date, mat)
names(final_data_sample)[1]<-"pub_date"

Check Datasets

head(api_data_sample)
head(final_data_sample)

References

https://developer.nytimes.com/archive_api.json

zj2213/NYTimesArticleStock documentation built on Jan. 6, 2021, 11:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zj2213/NYTimesArticleStock
NY Times Article Archive

In zj2213/NYTimesArticleStock: NY Times Article Archive

NYTimesArticleStock ---- (Option 1(b))

Author: Steven (Ze) Jia

Installation

Load the required packages

Scrap text data from NY Times

A glimpse of our scrapped dataset

Some summary statistics about the data

Grouping text by "NEWS" and date

Text Cleaning

Check Datasets

References

R Package Documentation

Browse R Packages

We want your feedback!

zj2213/NYTimesArticleStock NY Times Article Archive

In zj2213/NYTimesArticleStock: NY Times Article Archive

NYTimesArticleStock ---- (Option 1(b))

Author: Steven (Ze) Jia

Installation

Load the required packages

Scrap text data from NY Times

A glimpse of our scrapped dataset

Some summary statistics about the data

Grouping text by "NEWS" and date

Text Cleaning

Check Datasets

References

R Package Documentation

Browse R Packages

We want your feedback!

zj2213/NYTimesArticleStock
NY Times Article Archive