Sometimes we run into Medium Data, data that are too big to be easily used on our computers but still small enough that they can be used, with some special tools.
Reading in raw csv files of larger than 500MB can take some time. And the entire dataset must be stored in memory. There are a few solutions to speed this up, with various trade offs.
If the data you need loaded at a given point is small enough to fit in RAM, then data.table
is a great package. data.table
includes fread
, which provides much faster read times, and you can select specific columns rather than the entire data set.
``` {r eval = FALSE} library(data.table)
fread("", select = c(1,4,4), nrows = 100000)
Combined with **sampling**, you can probably do much of the exploratory analysis without ever loading the full data set. ####Selectable If the data are too big to hold and analyze in RAM, or you want more specific selection criteria than available when reading a file using `fread`, then either flat files or databases work well. [SQLite](https://sqlite.org/) is a simple free database that is easy to use. Importing a csv file into sqlite is as simple as writing an sqlite script.
.mode csv
.headers on
.import "
Then with `dplyr` you can select data from the sqlite database easily: ``` {r eval = FALSE} db <- dplyr::src_sqlite("path/to/sqlite_database") df < tbl(db, "table_name") %>% filter(tbl, year>2000, income>40000) %>% collect()
This returns just the selected data, while allowing the selection of a specific subset of data. There are numerous SQL database options, and much that you can do with dplyr
's interface for selecting, sorting, and summarizing data.
Another option is ff
and bit
, which keeps the data in files on the hard drive and allows for much of the same indexing and selecting behavior as data.frames
. It makes extensive use of chunking. A csv file can be read into ffdf files and saved to an ffdf object by:
``` {r eval = FALSE}
library(ff)
test <- read.csv.ffdf(file = "
ffsave(test,
file = "
The data can be indexed and sorted without loading into memory, or read into memory just in sections that you want: ``` {r eval = FALSE} ffload("data/test", overwrite = TRUE, rootpath = "data/") #Randomly sort the data N <- dim(test) s <- runif(N) test1 <- ffdf(test[order(r), c("col1", "col2", "col3")]) #Read ffdf formatted data into memory: dat <- test[1:100000, ] #is a data.frame #Processing can happen in chunks: library(bit) for (i in chunk(1, N, 10000)) { #Do some processing to the data }
Sampling can be hugely helpful, You can benchmark the selection yourself In benchmarks of
Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The html_vignette
output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The html_vignette
format:
Note the various macros within the vignette
section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title
field and the \VignetteIndexEntry
to match the title of your vignette.
The html_vignette
template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:
output: rmarkdown::html_vignette: css: mystyles.css
The figure sizes have been customised so that you can easily put two images side-by-side.
plot(1:10) plot(10:1)
You can enable figure captions by fig_caption: yes
in YAML:
output: rmarkdown::html_vignette: fig_caption: yes
Then you can use the chunk option fig.cap = "Your figure caption."
in knitr.
You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable()
.
knitr::kable(head(mtcars, 10))
Also a quote using >
:
"He who gives up [code] safety for [code] speed deserves neither." (via)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.