Other functions in bulkreadr"

knitr::opts_chunk$set(
  collapse = TRUE,
  message = FALSE, 
  warning = FALSE,
  comment = "#>",
  fig.path = "man/figures/",
  out.width = "100%")

options(tibble.print_min = 5, tibble.print_max = 5)

options(rmarkdown.html_vignette.check_title = FALSE)

The bulkreadr package in R includes specialized functions beyond bulk data reading, aimed at enhancing data analysis efficiency. These functions are designed to operate on individual vectors, except for inspect_na() and fill_missing_values(), which work on data frames.

pull_out()

pull_out() is similar to [. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (⁠%>%⁠) and base(|>) operators.

library(bulkreadr)
library(dplyr)

top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")

top_10_richest_nig %>% 
  pull_out(c(1, 5, 2))
top_10_richest_nig %>% 
  pull_out(-c(1, 5, 2))

convert_to_date()

convert_to_date() parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like 42370 into date value like 2016-01-01.

## ** heterogeneous dates **

dates <- c(
  44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
  "13-01-  2022", "2023", "2023-2", 41750.2, 41751.99,
  "11 07 2023", "2023-4"
  )

# Convert to POSIXct or Date object

convert_to_date(dates)

# It can also convert date time object to date object 

convert_to_date(lubridate::now())

inspect_na()

inspect_na() summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.

# dataframe summary

inspect_na(airquality)

Grouped dataframe summary

airquality %>% 
  group_by(Month) %>% 
  inspect_na()

fill_missing_values()

fill_missing_values() is an efficient function that addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. It supports various imputation methods for continuous variables, including minimum, maximum, mean, median, harmonic mean, and geometric mean. For categorical variables, missing values are replaced with the mode of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making.

df <- tibble::tibble(
  Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
  Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
  Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
  Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
  Species = c("setosa", NA, "versicolor", "setosa",
    NA, "virginica", "setosa"
  )
)
df

Impute using the mean method for continuous variables

#' df <- tibble::tibble(
#' Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
#' Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
#' Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
#' Species = c("setosa", NA, "versicolor", "setosa",
#'            NA, "virginica", "setosa")
#' )
result_df_mean <- fill_missing_values(df, method = "mean")

result_df_mean

Impute using the geometric mean for continuous variables and specify variables Petal_Length and Petal_Width

result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")

result_df_geomean

Impute missing values (NAs) in a grouped data frame

You can use the fill_missing_values() in a grouped data frame by using other grouping and map functions. Here is an example of how to do this:

sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
          "virginica", "virginica", "setosa")
)
sample_iris
sample_iris %>%
  group_by(Species) %>%
  group_split() %>%
  map_df(fill_missing_values, method = "median")


Try the bulkreadr package in your browser

Any scripts or data that you put into this service are public.

bulkreadr documentation built on June 8, 2025, 9:36 p.m.