knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, comment = "#>", fig.path = "man/figures/", out.width = "100%") options(tibble.print_min = 5, tibble.print_max = 5) options(rmarkdown.html_vignette.check_title = FALSE)
The bulkreadr
package in R includes specialized functions beyond bulk data reading, aimed at enhancing data analysis efficiency. These functions are designed to operate on individual vectors, except for inspect_na()
and fill_missing_values()
, which work on data frames.
pull_out()
is similar to [. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (%>%
) and base(|>
) operators.
library(bulkreadr) library(dplyr) top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu") top_10_richest_nig %>% pull_out(c(1, 5, 2))
top_10_richest_nig %>% pull_out(-c(1, 5, 2))
convert_to_date()
parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like 42370
into date value like 2016-01-01
.
## ** heterogeneous dates ** dates <- c( 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, "11 07 2023", "2023-4" ) # Convert to POSIXct or Date object convert_to_date(dates) # It can also convert date time object to date object convert_to_date(lubridate::now())
inspect_na()
summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.
# dataframe summary inspect_na(airquality)
Grouped dataframe summary
airquality %>% group_by(Month) %>% inspect_na()
fill_missing_values()
is an efficient function that addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. It supports various imputation methods for continuous variables, including minimum
, maximum
, mean
, median
, harmonic mean
, and geometric mean
. For categorical variables, missing values are replaced with the mode
of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making.
df <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", NA, "versicolor", "setosa", NA, "virginica", "setosa" ) )
df
If you do not specify selected_variables
(i.e., leave it as NULL
), the function will impute missing values for all columns in the dataframe.
# Impute using the mean result_df_mean <- fill_missing_values(df, method = "mean") result_df_mean
If you specify column names, only those columns will be imputed. For example, impute for variables Petal_Length
and Petal_Width
using the geometric mean.
result_df_geomean <- fill_missing_values(df, selected_variables = c ("Petal_Length", "Petal_Width"), method = "geometric") result_df_geomean
If you specify column positions, only the columns at those positions will be imputed.
# Impute using the maximum method result_df_max <- fill_missing_values(df, selected_variables = c (2, 3), method = "max") result_df_geomean
You can use the fill_missing_values()
in a grouped data frame by using other
grouping and map functions. Here is an example of how to do this:
sample_iris <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", "setosa", "versicolor", "setosa", "virginica", "virginica", "setosa") )
sample_iris
sample_iris %>% group_by(Species) %>% group_split() %>% map_df(fill_missing_values, method = "median")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.