knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library(disk.frame))

if(interactive()) {
  setup_disk.frame() 
} else {
  # only use 1 work to pass CRAN check
  setup_disk.frame(1)
}

GLMs

Prerequisites

In this article, we will assume you are familiar with Generalized Linear Models (GLMs). You are also expected to have basic working knowledge of {disk.frame}, see this {disk.frame} Quick Start.

Introduction

One can fit a GLM using the glm function. For example,

m = glm(dist ~ speed, data = cars)

would fit a linear model on the data cars with dist as the target and speed as the explanatory variable. You can inspect the results of the model fit using

summary(m)

or if you have {broom} installed

broom::tidy(m)

With {disk.frame}, you can run GLM dfglm function, where the df stands for disk.frame of course!

cars.df = as.disk.frame(cars)

m = dfglm(dist ~ speed, cars.df)

summary(m)


majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])

if((majorv == 3) & (minorv >= 6)) {
  broom::tidy(m)
} else {
  # broom doesn't work in version < R3.6 because biglm does not work
}

The syntax didn't change at all! You are able to enjoy the benefits of disk.frame when dealing with larger-than-RAM data.

Logistic regression

Logistic regression is one of the most commonly deployed machine learning (ML) models. It is often used to build binary classification models

iris.df = as.disk.frame(iris)

# fit a logistic regression model to predict Speciess == "setosa" using all variables
all_terms_except_species = setdiff(names(iris.df), "Species")
formula_rhs = paste0(all_terms_except_species, collapse = "+")

formula = as.formula(paste("Species == 'versicolor' ~ ", formula_rhs))

iris_model = dfglm(formula , data = iris.df, family=binomial())

# iris_model = dfglm(Species == "setosa" ~ , data = iris.df, family=binomial())

summary(iris_model)

majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])

if((majorv == 3) & (minorv >= 6)) {
  broom::tidy(iris_model)
} else {
  # broom doesn't work in version < R3.6 because biglm does not work
}

The arguments to the dfglm function are the same as the arguments to biglm::bigglm which are based on the glm function. Please check their documentations for other argument options.

Notes

{disk.frame} uses {biglm} and {speedglm} as the backend for GLMs. Unfortunately, neither package is managed on open-source platforms, so it's more difficult to contribute to them by making bug fixes and submitting bug reports. So bugs are likely to persists. There is an active effort on disk.frame to look for alternatives. Example of avenues to explore include tighter integration with {keras}, h2o, or Julia's OnlineStats.jl for model fit purposes.

Another package for larger-than-RAM glm fitting, {bigFastlm}, has been taken off CRAN, it is managed on Github.

Currently, parallel processing of GLM fit are not possible with {disk.frame}.



xiaodaigh/disk.frame documentation built on Feb. 3, 2023, 10:04 p.m.