In DataScienceSalon/xms: Extra-Marital Sex: Attitudes and Behaviors

options(knitr.table.format = "html")
options(max.print="75", scipen=999, width = 800)
knitr::opts_chunk$set(echo=FALSE,
                 cache=FALSE,
               prompt=FALSE,
               tidy=TRUE,
               root.dir = "..",
               fig.height = 8,
               fig.width = 20,
               comment=NA,
               message=FALSE,
               warning=FALSE)
knitr::opts_knit$set(width=100, figr.prefix = T, figr.link = T)
knitr::knit_hooks$set(inline = function(x) {
  prettyNum(x, big.mark=",")
})

load(file = "../data/GSS.Rdata")

xms <- preprocess(GSS)

eda <- univariate(xms$univariate)

Introduction

There is no dearth of research on American's changing attitudes towards marriage, its primacy as a way of life, and its exclusivity. Approximately four-in-ten Americans say that the present institution of marriage is becoming obsolete [@Taylor2010]. According to the same Pew Research 2010 study, 72% of all adults in America were married in 1960. By 2008, this percentage had dropped to 52%. From this apparent decline in marriage, Americans are slowly becoming more accepting of open relationships, consensual non-monogamy (CNM) practices, and the like. Nearly half of the population of Americans would consider an open relationship, though fewer than 4% actually claim to be in one [@Avvo]. Notwithstanding, nearly 22% of married men and 14% of married women admit to having an affair at least once during their marriages [@Johnson2017].

The purpose of this study is to examine the nature and evolution of attitudes and opinions with respect to (w.r.t.) extra-marital conduct. Serving as the data source are the opinions of over, 65,000 Americans (44% women, 56% men) over a period of 44 years (1973-2016), courtesy of the General Social Survey (GSS) [@NORCa], a project of NORC at the University of Chicago that monitors societal change and studies the growing complexity of American society.

Research Questions

Opinions were characterized as "traditional" when the attitude is that extra-marital conduct is always wrong, or almost always wrong. Alternatively, opinions were labelled as "non-traditional" if the belief is that such conduct is sometimes wrong or not wrong at all. Throughout this study, the short hand of "opinion" refers to "non-traditional" opinion. That said, the nature of opinion is examined via the following 11 research questions.

r kfigr::figr(label = "rq", prefix = TRUE, link = TRUE, type="Table"): Research Questions

rq <- openxlsx::read.xlsx("../data/gssvars.xlsx", sheet = 3)
knitr::kable(rq) %>%  
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Document Organization

The methods section describes the data, the sampling techniques, data preprocessing, and introduces the methods, descriptive and inferential techniques used in the analyses. Hypotheses are offered, analyzed and tested in the results section. The discussion section describes the significance of the findings vis-a-vis currently available research. Lastly, the conclusion synthesizes the key points.

Methods

Data

The General Social Survey (GSS), a project of the NORC at the University of Chicago, provided the data upon which this report is based. Formerly known as the National Opinion Research Center, "NORC at the University of Chicago is an objective non-partisan research institution that delivers reliable data and rigorous analysis to guide critical programmatic, business, and policy decisions" [@NORC]. Principally funded by the National Science Foundation, the GSS has been monitoring evolution, complexity opinions, behaviors, and attributes of American society since 1973. Targeting the adult population, age 18 and over in the United, the data covers a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

Data Sampling Strategy

The GSS employed a multi-stage area full probability sampling strategy.

At the first stage, NORC employed the Standard Metropolitan Statistical Areas (SMSAs) or non-metropolitan counties selected in NORC's Master Sample, as the Primary Sampling Unit (PSU). These were further subdivided into three categories. Category 1, representing 56% of the US population, comprised of the nations largest metropolitan areas or Consolidated Statistical Areas (CSAs) with a population of at least 1,543,728 (0.5 percent of the 2010 Census U.S. population). The second category, covering 35% of the US population, included Core Based Statistical Areas and CSAs with at least 8 tracts that are predominantly street-style addresses. The third category spanned 9% of the US population and consisted of small counties with at most 8 tracts. Minimum size for tracts were defined as 300 respondents.

Similarly, the second stage sampling was conducted in three categories. Category 1 comprised 216 type A tracts (city addresses) and 8 block-groups within type B (rural) tracts. Category 2, consisted of 8 segments per first-stage selection resulted in 480 segments in the 2010 National Sample Design, but GSS used only 120 of them. For category 3 first-stage selections, the 2010 National Sample Design only selected 5 segments per first-stage selection, but GSS used 4 and one half of them for a total of 56 segments.

The sampling strategy employed by GSS was designed to give each household an equal probability of being included in the sample. For each household selected, sampling procedures were undertaken to ensure that each individual in that household had an equal probability of being interviewed.

Data Sampling Bias

The GSS samples closely resemble distributions reported in the Census and other authoritative sources. However, survey non-response, sampling variation, and various other factors have introduced potential sources of bias and resulted in variance from Census distributions on some variables. For instance all full-probability samples under-represent males and block quota samples under-represented men in full-time employment. Weights were designed (and should be employed when using GSS 2004 or later), to assure proper representation of non-response sub-samples, and other factors. For a full discussion of distributional variation due to non-response, one should refer to the GSS Methodological Reports in the help and resources section at https://gssdataexplorer.norc.org/,

This full probability design was acknowledged by the National Science Foundation to be superior to simple random sampling, and generalizable to the population of adult citizens in America. Consisting of over 60,000 individual responses, spanning over 40 years, the GSS data remains one of the nations treasures, providing data for inference about American society to researchers, academics, students, politicians, and opinion makers. Yet, as respondents were randomly selected (not assigned), one is careful to limit inferences to association, and not causation. That said, the extent to which results of a research question can be generalized to the population will require that the inference conditions are met, specifically the proportion success/failure condition. As such, the question of "generalize-ability" will need to be re-addressed independently for each research question. Because there was no random assignment to groups, no causal inferences were drawn from this data.

Data Variables

The following table lists the variables extracted from the GSS data.

r kfigr::figr(label = "vars", prefix = TRUE, link = TRUE, type="Table"): Study Variables

vars <- openxlsx::read.xlsx("../data/gssvars.xlsx", sheet = 4)
knitr::kable(vars) %>%  
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Data Preprocessing

The GSS variables were extracted for years 1973 thru 2016 and the following variables were created.

Opinion: Two factor variable:
Traditional: Always wrong and almost always wrong
Non-Traditional: Sometimes wrong and not wrong at all

Age Group: Four factor variable:
18-24 * 25-44
45-64 * 65+

Region: Five factor variable:
Northeast: Comprised of responses from the New England and Mid Atlantic regions
Midwest: Comprised of responses from the East Central and West Central regions
South: Consisting of responses from the South Atlantic and East and West South Central regions
Mountain
* Pacific

Responses with "NAs", "Don't Know", and other non-response values for the GSS variables in scope, were filtered from the data set.

Data Analysis

This study involved univariate, bivariate and multivariate statistics.

Univariate analyses (exploratory data analysis) provided the descriptive and inferential statistics on a single-variable basis. Since all variables were categorical, frequencies and proportions were revealed using frequency distribution (contingency) tables and bar charts. Confidence intervals for the true population proportion for each value of each categorical variable were also computed.

The bivariate analyses examined, tested, and reported the existence, strength and direction of associations between opinion and a single explanatory variable. Hypothesis driven chi-square tests of independence were used to determine whether an association was extant. For this analysis, confidence intervals didn’t apply. To evaluate the strength and direction of an association, hypothesis tests for the difference in two proportions were conducted pairwise for each level of the explanatory variable, to determine the probability of encountering a difference as extreme as that observed, assuming a null hypothesis of zero difference. If the probability was less than $\alpha = 0.05$, the tolerable probability of a type I error, the null hypothesis was rejected in favor of the alternative hypothesis of a difference in proportions. In addition, confidence intervals for the difference in the proportions were computed for the difference in the proportions, given the null hypothesis. If the interval didn’t include zero difference, the null hypothesis of equal proportions was rejected. Note: the degree to which both approaches agree was addressed when both comparing two categorical levels, such as gender and opinion. For multiple two-proportion tests, some comparisons yielded significant differences, others did not. Comparing multiple p-values against a single chi-square statistic would not be appropriate in this case.

Similarly a hypothesis driven chi-square trend test, also known as the Cochran-Armitage test for trend, was conducted to determine whether a trend was extant. Based upon a chi-square statistic, a p-value is computed and an inference is made as was described above. This has one degree of freedom because the linear scoring of the proportions by year means that when one expected value is given all the others are fixed. This statistical technique is explained in further detail below.

Data Analysis Methodology

The following hypothesis driven approach was used for the univariate and multivariate analyses.
1. Plot the data
2. State Hypotheses
3. Check Parametric Inference Conditions
4. Determine the necessary minimum sample size (Single Proportion Inference) 5. Select appropriate statistical method / test statistic
6. Compute the p-value, confidence intervals
7. Interpret and report results.

Plot The Data

For the univariate analyses, data tables and bar plots illuminate the frequencies and proportions of responses for each level of the categorical variable. Histograms graphically depict the distribution of quantitative variables in the sample. For the multivariate analysis, data tables and bar plots show the proportion of opinion at each level of the explanatory categorical variable. The effect of a quantitative variable, such as the number of hours watching television, on opinion is graphically depicted using box-plots.

State Hypotheses

Four types of hypotheses statements were used in this study.
1. Single proportion
2. Difference of two proportions
3. Association 4. Trend in proportions

Single Proportion
This type of hypothesis statement, used in the univariate analysis of categorical variables, was used to estimate the true population parameter and as follows:
$H_0$ $p - \hat{p} = 0$
$H_a$ $p - \hat{p} \neq 0$
where:
$p$ is the true population proportion
$\hat{p}$ is the observed sample proportion

Differences in two Proportions
This type of hypothesis statement was used in the multivariate analysis to determine the true difference in the proportion of non-traditional opinion for two populations and took the following form.
$H_0$ $p_1 - p_2 = 0$
$H_a$ $p_1 - p_2 \neq 0$ Two sided, or
$H_a$ $p_1 < p_2$ One sided, or
$H_a$ $p_1 > p_2$ One sided
where:
$p_1$ is the proportion of non-tradition opinion in the first population of interest
$p_2$ is the proportion of non-tradition opinion in the second population of interest

Association
This type of hypothesis statement was used to ascertain the association between opinion and categorical explanatory variables and took the following form.
$H_0$ $p_1 = p_{10}, p_2 = p_{20}, ...p_k = p_{k0}$
$H_a$ At least one proportion is different from the others.
where:
$k$ is the number of response categories
$p_k$ is the observed proportion for the $k^{th}$ response category
$p_{k0}$ is the expected proportion for the $k^{th}$ response category

Trend in Proportions This was used to determine wither a trend in opinion was extent over time, controlling for gender and took the following form:
$H_0$ A trend exists in the proportion of opinion over time, controlling for gender.
$H_a$ A trend does not exists in the proportion of opinion over time, controlling for gender.

Inference Conditions

Statistical inferences were derived from parametric statistics and the central limit theorem, whereby the former assumes that sample data comes from a population distribution with a fixed set of parameters, and the later assumes that the sampling distribution for the parameter of interest is normal. The parameters for inferences for proportions are $N(\mu, \sigma^2)$, where $\mu$ is the population proportion parameter $\hat{p}$ and $\sigma^2$ is the variation of of the population proportion parameter. The degree to which the distribution can be considered normal depends upon the type of inference being undertaken.

Inference for Single Proportion
These conditions applied to the univariate analyses of the categorical variables. One can assume that the sampling distribution of the proportion $\hat{p}$ is normal if:
1. Independence: The sample observations are independent. Since all samples were obtained through complex random sampling, and the sample consists of less than 10% of the adult U.S. population, independence was assumed for all statistical tests.
2. Success/Failure: There must be a minimum of ten successes and failures in the sample.

Inference for Difference of Two Proportions
These conditions applied to the multivariate analyses of the difference in the proportion of non-traditional opinion in two populations. One can assume that the sampling distribution of the difference of two proportions $p1 - p2$ is normal if:
1. Within Sample Independence: Each sample meets the independence criteria, specifically, samples were obtained from a random sampling process and consists of less than 10% of the adult U.S. population.
2. Cross-Sample Independence: Both samples are independent from each other.

Inference for Association and Trend (Chi-square independence & Cochran-Armitage test)
These conditions applied to the multivariate analyses of the association between opinion and the various explanatory variables. The following three conditions were confirmed for all association tests.
1. Independence: Each case that contributes a count to the table must be independent of all the other cases in the table.
2. Sample Size: Each particular scenario (i.e. cell count) must have at least 5 expected cases.
3. Degrees of Freedom: The contingency table must be associated with a chi-square distribution with two or more degrees of freedom. This requirement is relaxed for trend tests.

Minimum Sample Size

Minimum required sample sizes were computed during the univariate analysis to confirm that the sample parameters were within a 5% margin of error of the true population parameters. The minimum sample size for estimating the population proportion was computed as follows:

$$N_{min} = \frac{p_l(1-p_l)}{(\frac{me}{z^})^2}$$ where $p_l$ is the sample proportion for the categorical variable at level $l$ $me$ is the margin of error $z^$ is the critical value on the z-distribution for the designated margin of error

Statistical Methods

Four types of statistical methods were used to evaluate the hypotheses statements:
1. Single proportion z-tests for estimating the true population parameter
2. Two-proportion z-tests for the difference in proportions with confidence intervals
3. Chi-square test for independence
4. Cochran-Armitage test for trends in proportions

Single Proportion z-test

Single proportion z-tests were conducted to used to construct confidence intervals for a population proportion, and is computed as follows: $$\hat{p} \pm z^ * SE$$ where:
$\hat{p}$ is the point estimate for the population proportion, e.g. the sample proportion
$Z^$ is the $z_{\alpha/2}$ critical value for a two sided z-distribution at a 95% confidence level $\approx 1.96$

The standard error $SE$ is computed as follows:
$$SE= \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
where:
$n$ is the sample size
$\hat{p}$ is the the sample proportion of interest

Two proportion z-test

To evaluate hypotheses w.r.t. the difference in two proportions, z-tests were conducted using the pooled proportions for the standard error calculation. Pooled proportions, $\hat{p}$, were calculated as follows:

$$\hat{p} = \frac{\hat{p_1}n_1 + \hat{p_2}n_2}{n_1 + n_2}$$
where:
$n1$ is the sample size of the sample 1
$n2$ is the sample size of the sample 1
$p1$ is the the proportion of of from sample 1
$p2$ is the the proportion of of from sample 2

Given the pooled proportion, the pooled standard error, $SE_{pooled}$, for the difference in proportions was calculated as follows:
$$SE_{pooled} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n_1} + \frac{\hat{p}(1-\hat{p})}{n_2}}$$
where:
$\hat{p}$ is the pooled proportion calculated above
$n1$ is the sample size of the sample 1
$n2$ is the sample size of the sample 2

Lastly, the z-statistic under the null hypothesis of equal proportions was calculated as follows:
$$Z = \frac{(\hat{p_1} - \hat{p_2}) - 0}{SE_{pooled}}$$ The z-statistic was used to calculate a p-value by comparing the value of the statistic for each random variable $X$ to a standard normal distribution ($X \sim N(\mu = 0, \sigma^2 = 1)$. The selected level of confidence for all statistical tests was 95% ($\alpha = 0.05$). If the p-value was less than or equal to $\alpha$, the null hypothesis was rejected. Otherwise, the null hypothesis was not rejected. The p-value for a Gaussian distribution was obtained using the R statistical software package and is computed as follows:
$$p(x;\mu, \sigma^2) = \frac{1}{\sqrt{2\pi} * \sigma} * exp(-\frac{(x-\mu)^2}{2\sigma^2})$$ where:
$x$ = random variable
$\mu$ = mean of random variable $x$
$\sigma$ = standard deviation of a random variable $x$

Z-tests were conducted to compute the confidence interval for the difference in two proportions. The 95% confidence intervals for the differences in two proportions were computed as follows:
$$C = (\hat{p_1} - \hat{p_2}) \pm Z^ * SE$$ where:
$\hat{p_1}$ is the the proportion of interest from sample 1
$\hat{p_2}$ is the the proportion of interest from sample 1
$Z^$ is the $z_{\alpha/2}$ critical value for a two sided z-distribution at a 95% confidence level $\approx 1.96$

The standard error $SE$ is computed as follows:
$$SE= \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}$$
where:
$n1$ is the sample size of the sample 1
$n2$ is the sample size of the sample 1
$\hat{p_1}$ is the the proportion of of from sample 1
$\hat{p_2}$ is the the proportion of of from sample 2

Relative risk ratio was calculated to estimate the probability of encountering an opinion (traditional or non-traditional) for two levels of an explanatory variable. Relative risk was computed as follows:
$$RR = \frac{\frac{a}{a+b}}{\frac{c}{c+d}}$$
where:
$a$ is the probability of having opinion x, given condition 1
$b$ is the probability of having opinion y, given condition 1
$c$ is the probability of having opinion x, given condition 2
$d$ is the probability of having opinion y, given condition 2

Multiple Two-Proportion Tests*
Lastly, multiple difference in proportion tests were often conducted to compare the differences across several levels of a categorical explanatory variable. Since the chance of rare events increases with multiple tests, the likelihood of incorrectly rejecting the null hypothesis also increases. To compensate for that increase the Bonferroni correction was applied to the hypothesis tests, changing the significance level to $\alpha / m$, where $\alpha$ is the level of significance and $m$ is the number of hypothesis tests. The number of hypothesis tests $m = k(k-1)/2$, where $k$ is the number of groups being tested. The confidence interval was also adjusted, changing the overall confidence level from $1 - \alpha$ to $1 - \alpha/m$. The Marascuilo method of multiple comparisons was used to test all combinations of groups.

Chi-square test of independence.

To evaluate the association between opinion and the categorical explanatory variables, Pearson’s chi-square $X^2$ test of independence was employed to compute the probability of observing differences in proportions as extreme as the differences observed, assuming the null hypothesis that observed counts $O$, and expected counts $E$ are equal. The expected counts are computed as follows:

$E_{(i,j)} = \frac{0T_i * 0T_j}{0T}$ where:
$E_{(i,j)}$ is a contingency table of expected counts with rows $i$ and columns $j$.
$0T_i$ is the total for row $i$ from the observed contingency table.
$0T_j$ is the total for column $j$ from the observed contingency table.
$0T$ is total counts for the observed contingency table.

Given the observed and expected counts, the $X^2$ test statistic is the normalized sum of squared deviations between observed and expected counts and is computed as follows: $$X^2 = \displaystyle\sum_{i=1}^{r} \displaystyle\sum_{j=1}^{c} \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}$$ where:

$X^2$ is Pearson's cumulative test statistic.
$O_{i,j}$ is the frequency or count from row $i$ and column $j$ of the observed contingency table
$E_{i,j}$ is the frequency or count from row $i$ and column $j$ of the expected contingency table
$r$ is the number of rows in the observed (and expected) contingency tables
$c$ is the number of columns in the observed (and expected) contingency tables

The chi-squared statistic was used to calculate a p-value by comparing the value of the statistic to a chi-squared distribution with $df$ degrees of freedom, where $df = (r-1) * (c-1)$. The selected level of confidence for all statistical tests was 95% ($\alpha = 0.05$). If the p-value was less than or equal to $\alpha$, the null hypothesis was rejected. Otherwise, the null

Cochran-Armitage Test

The Cochran-Armitage test for trends in proportions is a more powerful version of the chi-square test for independence that takes into account a linear scoring for an ordinal categorical variable. Assume there are $k$ independent binomial variates $y_i$, with sample sizes $n_i$, at ordinal explanatory categorical levels $x_i$, for $i = 1, 2, ..., k$ where $x_1 < x_2 < ... < x_k$, let:

$N = \displaystyle\sum_{i=1}^{k} n_i$
$\hat{p} = \frac{1}{N}\displaystyle\sum_{i=1}^{k} y_i$
$\hat{q} = 1 - \hat{p}$
$\bar{x} = \frac{1}{N}\displaystyle\sum_{i=1}^{k} n_ix_1$

Th uncorrected test statistic is:
$$z = \frac{\displaystyle\sum_{i=1}^{k} y_i(x_i-\bar{x})}{\sqrt{}\hat{q}\hat{p}[\displaystyle\sum_{i=1}^{k} n_i(x_i-\bar{x})^2]}$$

Compute p-Value and Confidence Intervals

As indicated above, both p-value and confidence interval approaches were used for two proportion hypothesis tests.

Interpret and report results

Both hypothesis testing (p-values) and confidence intervals were used to interpret the difference proportions. For the hypothesis tests, the null hypotheses were evaluated as follows:
p-value $\neq \alpha$, the null hypothesis is rejected in favor of the alternative hypothesis.
p-value $= \alpha$, the null hypothesis is not rejected.

For the confidence interval approach, the null hypotheses were evaluated as follows:
Confidence interval does not include zero, the null hypothesis is rejected. in favor of the alternative hypothesis.
Confidence interval includes zero, the null hypothesis is not rejected.

System and Environment

This analysis was implemented using the 64 bit version of the R Programming Language, version 3.4.1. [@TheRFoundation2015] within the R. Studio Version 1.1.330 [@RStudioTeam2016] development environment on a Windows x64-based laptop powered by an Intel Core i7-3610QM CPU @ 2.30GHz, 2301 MHz processor with 4 Cores, 8 Logical Processors, and 16.0 GB of installed memory, running the Microsoft Windows 10 Home operating system, version 10.0.14393 Build 14393. Statistical analysis functionality (chisq.test, prop.trend.test) was provided by the Stats package [@RCoreTeam2013a]. Report writing and generation packages included knitr [@Xie2013], and kfigr [@Koohafkan2015]. Data management functionality was provided by the dplyr [Wickham2015a], reshape2 [@Rcpp2016], xtable [@Dahl2016] and data table [@Dowle2016] packages. Graphics and data visualization were powered by the ggplot2 [@Wickham2016], gridExtra [@BaptisteAuguie2016], and the stargazer [@Hlavac2015] packages.

Reproducibility

This report, along with the software and data are available as part of the xms package which is available for forking at https://github.com/DataScienceSalon/xms. Alternatively, the package can be installed from github using devtools as follows:

devtools::install_github("DataScienceSalon/xms")

Results

Exploratory Data Analysis

The following descriptive and inferential statistics were computed for each study variable.
N: The counts for each level of the categorical variable
Minimum N: The minimum sample size required to ensure that the sample parameters were within a 5% margin of error (the confidence intervals) of the population parameters
Proportion: For categorical variables, the proportional responses at each level of each categorical variable
Cumulative Proportions: The cumulative proportions at each level of each categorical variable
* Confidence Interval: The 95% confidence interval for the population parameter
Furthermore, the conditions for inference were checked in order to characterize the degree to which the samples were representative of the population proportions.

Opinion

As shown in r kfigr::figr(label = "edaOpinionStats", prefix = TRUE, link = TRUE, type="Table"), there were a total of r sum(eda$opinion$stats$N) observations, with r eda$opinion$stats$N[1], and r eda$opinion$stats$N[2] traditional and non-traditional opinions, respectively. Of all respondents interviewed since 1973, r round(eda$opinion$stats$Cumulative[1] * 100, 0)% considered extra-marital conduct to be wrong or almost always wrong.

r kfigr::figr(label = "edaOpinion", prefix = TRUE, link = TRUE, type="Table"): Descriptive and inferential statistics for the opinion variable

knitr::kable(eda$opinion$stats, align = c("l", rep("c", 7))) %>%  
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")