The DHSr package offers advanced tools designed for repeated regression analysis, particularly well-suited for demographic and health survey datasets, though it can be applied to other datasets as well. It includes capabilities for compound statistical analyses using Stein’s estimate and features a unique approach to spatial data analysis, enabling the identification of clusters within hot-hot and cold-cold regions. The package also includes a basic map visualization feature, primarily intended for visualizing calculations. For users seeking more control over mapping, the derived variables from the package can be used with other mapping packages for enhanced visualization options
You can install the released version of DHSr
from CRAN with:
install.packages(“DHSr”)
The Replm2
function allows you to run linear regressions across all
locations in a dataset. Below is a basic example using a simple formula.
here are the parameters :
data
: The dataset containing the variables you want to analyze.
formula
:This specifies the model you want to fit, using standard R
formula syntax (e.g., years_education ~ gender_female +
household_wealth).
location_var
: The name of the variable in your dataset that identifies
different locations or groups (e.g., district_code). The function will
run a separate regression for each unique value in this variable.This
parameter is mandatory. If the data is not spatial data, but if the data
has other grouping varible that also can be passed here to perform
repeated regression for each group
response_distribution
: Specifies the type of regression model to fit.
Commonly, “normal” is used for linear regression. For logistic or other
types of models, you would specify an appropriate distribution.
family
: (Optional) Used when you are performing a generalized linear
model (GLM), such as logistic regression. For linear regression with a
normal distribution, this can be set to NULL.
library(DHSr)
#> DHSr package loaded with all dependencies.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
dummy_data <- data.frame(
years_education = rnorm(100, 12, 3), # Represents years of education
gender_female = rbinom(100, 1, 0.5), # 1 = Female, 0 = Male
household_wealth = sample(1:5, 100, replace = TRUE), # Wealth index from 1 to 5
district_code = sample(1:10, 100, replace = TRUE) # Represents district codes
)%>% arrange(district_code)
shapefile_path <- "C:/Users/91911/Documents/India%3A_State_Boundary_2021_.shp"
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth + household_wealth:gender_female
# Run the regression across all locations (districts)
results1 <- Replm2(dummy_data, formula, "district_code", "normal")
# Print the results
print(head(results1))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_household_we…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 22.4 -11.6 -2.81
#> 2 2 13.0 2.29 -0.229
#> 3 3 12.4 1.89 -0.389
#> 4 4 12.1 -0.116 -0.0620
#> 5 5 19.5 -3.46 -2.00
#> 6 6 11.3 1.67 0.517
#> # ℹ abbreviated name: ¹estimate_household_wealth
#> # ℹ 12 more variables: `estimate_gender_female:household_wealth` <dbl>,
#> # `estimate_R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>, std_error_household_wealth <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_R-squared` <dbl>, `p_value_(Intercept)` <dbl>,
#> # p_value_gender_female <dbl>, p_value_household_wealth <dbl>, …
The Replmre2
function is designed to run mixed-effects models across
all locations defined by a categorical variable in your dataset. It
supports linear mixed-effects models (LMEMs), making it suitable for
continuous response variables and normally distributed errors.
random_effect_var
: Specifies the grouping variable for random
effects. Note : one common issue ‘singularity error’ occurs with complex
random effects; to avoid try dropping that particular location from the
data.
set.seed(123)
# Create HHid (Household ID), grouping every 3-4 records, and convert to character
dummy_data$HHid <- as.character(rep(1:20, each = 5, length.out = nrow(dummy_data)))
# Test the function with full_rank dataset
data <- dummy_data
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth:gender_female
location_var <- "district_code"
random_effect_var <- "HHid"
results <- DHSr::Replmre2(data, formula, location_var, random_effect_var)
#> Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
print(head(results))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 12.6 -1.73 0.522
#> 2 2 12.3 3.03 -1.08
#> 3 3 11.1 3.27 -0.110
#> 4 4 11.9 0.0698 0.124
#> 5 5 12.0 4.04 -1.06
#> 6 6 12.9 0.906 -0.660
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> # `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_Marginal R-squared` <dbl>,
#> # `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …
Repglmre2
: Extends Replmre2 to non-normal regressions like logistic
or Poisson
# Create a binary outcome variable for years of education
dummy_data$education_binary <- ifelse(dummy_data$years_education > 11, 1, 0)
# Define a logistic regression formula
formula <- education_binary ~ gender_female + household_wealth:gender_female
location_var <- "district_code"
random_effect_var <- "HHid"
# Run the logistic mixed-effects model across all locations (districts)
results <- DHSr::Repglmre2(data = dummy_data, formula = formula,
location_var = location_var, random_effect_var = random_effect_var,
family = binomial())
# Print the results
print(head(results))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.75 -0.228 2.17e- 2
#> 2 2 0.800 0.200 -7.77e-12
#> 3 3 0.462 0.449 -1.96e-17
#> 4 4 0.714 0.514 -1.43e- 1
#> 5 5 0.750 0.114 -4.55e- 2
#> 6 6 0.715 0.131 -1.35e- 1
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> # `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_Marginal R-squared` <dbl>,
#> # `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …
listw
: Essential for cluster map creation and Stein’s beta; optional
if weights exist.
spdeplisa
: Computes LISA values using native spdep functions. It’s
recommended to run this before cluster_map for consistency, though it
may not be needed if LISA signs and indicators are precomputed
cluster_map
: Produces clusters from spatial data, with cluster IDs
and admin boundary listings (e.g., districts or states). The min_
parameter sets the minimum number of admin boundaries per cluster, and
min_area sets the minimum cluster area. The lisa_cutoff adjusts the
threshold for clustering, useful for continuous clusters. The function
provides a basic map; for more dynamic mapping, users can use otyher
packages ( such as ggplot2::geom_sf) after obtaining cluster IDs. The
lisa_label and label specify the type of clusters (e.g., High-High).
##cluster_map parameter:
dataset
: Input spatial dataset ( “sf” “data.frame” object)
lisa_value
: Variable used for LISA analysis lisa_label
: Variable
indicating LISA significance label
: Specific label for clustering
(e.g., “High-High”) lisa_cutoff
: Threshold for LISA values
location_var
: Column name for location variable location_name
:
Column name for location name level2
: Optional second level of
geography id_start
: Starting ID for cluster numbering comparison
:
Comparison operator for cutoff min_area
: Minimum area for valid
clusters min_
: Minimum number of districts per cluster title
: Title
for the resulting plot subtitle
: Subtitle for the resulting plot
footnote
: Footnote for the resulting plot legend_position
: Position
of the legend on the plot color_scheme
: Color scheme for the plot
# Load the sf package if not already loaded
library(sf)
#> Warning: package 'sf' was built under R version 4.3.3
#> Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
# Group by district_code and calculate mean household_wealth
mean_wealth <- dummy_data %>%
group_by(district_code) %>%
summarize(mean_household_wealth = mean(household_wealth, na.rm = TRUE))
# Use the custom listw function with loc_shape and loc_data parameters
listw <- listw(
shapefile_path = shapefile_path, # Use the predefined shapefile path
data = mean_wealth,
loc_shape = "objectid", # The key column to be used for merging in the shapefile
loc_data = "district_code", # The column in data that will be used if loc_shape doesn't exist
weight_function = function(d) exp(-d / 0.2)
)
#> Reading layer `India%3A_State_Boundary_2021_' from data source
#> `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS: WGS 84
#> Warning in nb2listw(res$neighbours, glist = res$weights, style = style, : zero
#> sum general weights
# Read the shapefile
shape_data<- st_read(shapefile_path)
#> Reading layer `India%3A_State_Boundary_2021_' from data source
#> `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS: WGS 84
# Ensure uniqueness by REGCODE
shape_data <- shape_data %>% distinct(objectid, .keep_all = TRUE)
#objectid is the location-shape(loc_shape) in listw
# Merge shapefile data with the provided dataset
mean_wealth_geom<- merge(shape_data, mean_wealth, by.x = "objectid", by.y="district_code", all.y = TRUE)
lisa2 <- Spdeplisa(mean_wealth_geom, "mean_household_wealth",listw)
result <- cluster_map(lisa2, "lisa_I", "sign_combination3", "High-High", 0.4, "objectid", "name", "objectid", id_start = 0, comparison = ">", min_area = 0, min_ = 1,
title = "Wealth Clusters ", subtitle = "State Level", footnote = NULL, legend_position = "right", color_scheme = "D")
plot<- result$plot
clusters <-result$dataset_with_clusters
#merge the OLS beta's for each location
cluster_beta <- merge(clusters, results1, by.x = "objectid", by.y = "location")
# Example usage
results_with_stein_beta <- DHSr::stein_beta(
data = cluster_beta,
cluster_id = "cluster_id",
beta = "estimate_gender_female" # Replace with the actual column name for beta in your data
)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.