Obesity: A two-level incomplete dataset based on an online obesity...

ObesityR Documentation

A two-level incomplete dataset based on an online obesity survey

Description

This synthetic dataset was generated from an online survey on obesity, which collected information on the dietary behavior of 2111 participants. We made the assumption that the data was gathered from five distinct locations or clusters. To account for potential selection bias in the responses related to weight, we simulated the values and observability of this variable using the Heckman selection model within a hierarchical structure.

Additionally, we assumed that in one of the locations, the weight variable was systematically missing. We also introduced missing values for some other variables in the dataset using a Missing at Random (MAR) mechanism.

Format

A dataframe with 2111 observations with the following variables:

Gender a factor variable with two levels: 1 ("Female"), 0 ("Male").
Age a numeric variable indicating the subject's age in years.
Height a numeric value with Height in meters.
FamOb a factor variable describing the subject's family history of obesity with two levels: 1("Yes"), 0("No").
Weight a numeric variable indicating the subject's weight in kilograms.
Time a numeric variable indicating the time taken by the subject to respond to the surveys questions in minutes.
BMI a numeric variable with the subject's body mass index.
Cluster a numeric variable indexing the cluster.

Details

Data generation code availble on https://github.com/johamunoz/Statsmed_Heckman/blob/main/4.Codes/gendata_Obesity.R

Source

Synthetic data based on the data retrieved from "https://www.kaggle.com/datasets/fabinmndez/obesitydata/"

References

Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in brief, 25, 104344.

Examples

library(mice)
library(ggplot2)
library(data.table)

data(Obesity)
summary(Obesity)
md.pattern(Obesity)

# Missingness per region (Weight)
dataNA <- setDT(Obesity)[, .(nNA = sum(is.na(Weight)),n=.N), by = Cluster]
dataNA[, propNA:=nNA/n]
dataNA

# Density per region (Weight)
Obesity$Cluster <- as.factor(Obesity$Cluster)
ggplot(Obesity, aes(x = Weight, group=Cluster)) +
  geom_histogram(aes(color = Cluster,fill= Cluster),
                 position = "identity", bins = 30) + 
                 facet_grid(Cluster~.)

micemd documentation built on Nov. 17, 2023, 5:07 p.m.