TenK
is an R package aimed at simplifying the collection of SEC 10-K annual reports. It contains the following features:
This document introduces basic usage of the TenK
package.
A copy of this documentation is available via R in PDF format. To view it, execute vignette("TenK")
in your R console.
TenK
can correctly scrape approximately 90% of all business descriptions. If any, issues are usually related to the following causes:
You may observe that words appear to be "squished" together, like so:
...holder(s).employeesas of december 28, 2014, we had 8,696 full-time employees, including 3,014 employees in research and development, 1,023 employees in sales and marketing, 639 employees in general and administrative, and 4,020 employees in operations. none of our employees is represented by a collective bargaining agreement and we have never experienced any employee work stoppage. we believe that our employee relations are good.10table of contentsexecutive officersour executive officers,..
This is due to the way rvest
extracts text from the page and mainly affects paragraph headers. I am looking for a way to fix this.
The main function in this package, TenK_process
, takes as its input a URL belonging to a 10-K report. The URL can point either to the FTP or the HTML version of the report. If the user passes an FTP url, then TenK_process
automatically determines the HTML version and collects useful metadata. If the user passes an HTML url, TenK_process
also collects metadata and returns the scraped text. Currently, TenK_process
either returns the full 10-K report, or the business description section.
The figure below schematically outlines this process
To install the TenK
package, execute:
# Install if not present if(!require(devtools)) install.packages("devtools") # Load library(devtools) # Install TenK from github install_github("JasperHG90/TenK")
You can then load the package as follows:
library(TenK)
For the years 2013-2016, TenK
provides datasets containing FTP urls for each 10-K filing. These can be queried as follows:
data("filings10K2013") # Exchange 2013 for 2014, 2015 or 2016 for later years. dim(filings10K2013)
For more information about these data, execute ?filings10K2013
in the R console.
To retrieve a report, use the TenK_process
function. It has the following parameters:
Retrieving all metadata plus the report is straightforward. This is demonstrated in the code block below.
# Retrieve business description res <- TenK_process(filings10K2013$ftp_url[1], retrieve = "BD") # Print print(names(res))
You can retrieve a similar result when using a direct HTML url:
# Retrieve business description res2 <- TenK_process(res$htm10kurl, retrieve = "BD") # Print print(names(res2))
As you can see, this query does not return the 'FTPurl' field.
The names of these results correspond to the following metadata:
Variable Description Optional
CIK Central Index Key (CIK) of the company. CIK numbers are unique No identifiers that the SEC assigns to all entities and individuals that file disclosure documents.
ARC SEC accession number. The accession number is a unique number No that EDGAR assigns to each submission as the submission is received. You cannot use accession numbers to filter for types of filings.
index.url Company filings index url. For an example, see: Yes this url
company_ Name of the company Yes name
filing_ Date on which the report was filed to the SEC Yes date
date_ Date/time on which the SEC accepted the report Yes accepted
period_ Fiscal year to which the report belongs. Yes report
htm10kurl URL pointing to the HTML version of the report. Yes
htm10kinfo Meta information about the HTML version of the report. Yes Contains file name, report type, file size and file extension
FTPurl URL pointing to the FTP version of the report No
report Either the business description or the full report No
Optional fields can be manually selected/deselected.
If you want to select optional metadata fields, you can do so by passing a list to the 'meta_list' parameter. This is demonstrated in the code block below.
# Retrieve business description - select metadata fields res_select <- TenK_process(filings10K2013$ftp_url[1], meta_list = list("filing_date" = 1, "date_accepted" = 1, "htm10kinfo" = 1), retrieve = "BD") print(names(res_select))
If you don't desire any metadata, you can turn this off by setting the 'metadata' parameter to FALSE:
# Retrieve business description without metadata res_no_metadata <- TenK_process(filings10K2013$ftp_url[1], metadata = FALSE) # Print print(names(res_no_metadata))
There are several ways in which you can store the results of the TenK_process
function. Here, I'll outline 4 ways to do this.
You can save R lists (this is what TenK_process
returns) as JSON files:
library(rjson) # Install if you don't have it # Convenience function savetojson <- function(data, path) { # Write g <- toJSON(data) write(g, paste0(path, "/data.json")) } # Run savetojson(res, "/users/jasper/desktop")
You can load the data as follows:
library(rjson) res <- fromJSON(file = "/users/jasper/desktop/data.json")
Postgresql is a stable, fast and flexible SQL database. Unlike MySQL, it is able to store large text files and is capable of storing terabytes of data.
After installing postgresql, you can use the RPostgreSQL
package to store and retrieve data.
The first step is to create a table with field names and data types. The example below does this for all metadata. In your own case, you may want to delete some of these fields if you don't require them.
library(RPostgreSQL) # Open connection db <- dbConnect(PostgreSQL(), user = "Jasper", host = "127.0.0.1") # Drop db dbSendQuery(db, "DROP TABLE IF EXISTS tenk_reports ;") # Disconnect dbDisconnect(db)
# Install if you don't have this package library(RPostgreSQL) # Open connection db <- dbConnect(PostgreSQL(), user = "Jasper", host = "127.0.0.1") # Create table q <- dbSendQuery(db, "CREATE TABLE tenk_reports ( cik integer, arc bigint, index_url VARCHAR(150), company_name VARCHAR(150), filing_date date, date_accepted TIMESTAMP, period_report date, htm10kurl VARCHAR(150), htm10kinfo_10K_url VARCHAR(100), htm10kinfo_type CHAR(4), htm10kinfo_size integer, htm10kinfo_extension CHAR(3), ftpurl VARCHAR(150), report TEXT );")
Once you've created your postgres table, you can append data to it by using the dbWriteTable
function:
# Add data - easiest way is to convert list to df res_df <- as.data.frame(res, stringsAsFactors = F) # This gives a df with 14 columns and 1 row dim(res_df) # Replace '.' with '_' and lowercase names names(res_df) <- gsub("\\.", "_", names(res_df)) names(res_df) <- tolower(names(res_df)) # Write dbWriteTable(db, "tenk_reports", res_df, append=T, row.names = F)
The htm10kurl
effectively functions as a unique ID for each report. As such, it is convenient to use it as a way to check if a given record already exists in a table:
dbCheck <- function(conn, URL) { # Query q <- dbSendQuery(db, paste0("SELECT exists (SELECT 1 FROM tenk_reports WHERE htm10kurl = ", "'", URL, "'", " LIMIT 1);")) # Fetch f <- fetch(q) # Clear result dbClearResult(q) # Return exists return( f$exists ) }
Before you store the record, you can run the check
# Run check check <- dbCheck(db, res$htm10kurl) # Print print(check)
If the function returns TRUE, the record already exists. If it returns FALSE, you can go ahead and store the record.
To query data from the database, you can use dbReadTable
:
res_query <- dbReadTable(db, "tenk_reports") # Disconnect dbDisconnect(db)
As you can see, the data types (which we set when creating the table) are also imported into R:
str(res_query, nchar.max = 10)
Mongodb is a NoSQL database that excels at storing large documents and unstructured data (e.g. not column/row pairs).
After installing MongoDB on your system, you can send and load data using the rmongodb
package.
You don't need to explicitly state that you want to create a mongodb database; rather, you would just start using it ad hoc. Note that with mongodb, a namespace is a combination of the database and the collection (similar to SQL table).
library(rmongodb) # install if you don't have it # Details database <- "tenk_reports" col <- "records" ns <- paste0(database,".",col) print(ns) # Create mongo connection m <- mongo.create()
To store a record in the database, you can use mongo.insert
:
# Insert data mongo.insert(m, ns, res)
Once again, it is a good idea to check if the record already exists in the database. You can do this as follows:
recExists <- function(mongo_connection, ns, URL) { # Find q <- mongo.find.all(mongo_connection, ns, query = list("htm10kurl" = URL)) # If len >0 , return TRUE, else return FALSE ifelse( length(q) > 0, return(TRUE), return(FALSE)) }
You can then call it like this:
ex <- recExists(m, ns, res$htm10kurl) print(ex)
To retrieve a record, you can use mongo.find.one()
or mongo.find.all()
# Find one record res_mongo <- mongo.find.one(m, ns, query = list("htm10kurl" = res$htm10kurl)) # Note that the result is a mongo BSON class(res_mongo) # We can turn it into an R list like this res_mongo_list <- mongo.bson.to.list(res_mongo) # We can also query all records at once - note that these get converted to a list immediately res_mongo <- mongo.find.all(m, ns) # Disconnect mongo.destroy(m)
# Drop mongo collection mongo <- mongo.create() mongo.drop.database(mongo, "tenk_reports") mongo.destroy(mongo)
Note that, unlike with postgresql, the data does not automatically have the right data type. This is a drawback of schema-less databases like mongo.
str(res_mongo, nchar.max = 10)
An Rdata file is a flexible and secure way to store R objects in a highly compressed file on disk. You can save your results as follows:
# Save to Rdata save(results, file="/users/jasper/desktop/results.Rdata")
To load the data, execute the following:
# Load load(file="/users/jasper/desktop/results.Rdata")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.