Diversity and Inclusion Report

Introduction

This report is for the manager who honestly tries to forge strong and diverse teams, and fosters an inclusive atmosphere. Nobody is free from bias and we are influenced by who we are as well as by by our environment. Our brain is evolved to do pattern recognition, and just as machine learning that will pick up patterns that might be true (or true in our distorted perception of the world) on average, but forego the right of everyone to be treated as an individual. Even with the best intentions, each one of us will have certain biases: both active and passive. Active bias is where one holds an explicit or implicit bias and hence will automatically value people more based on that bias. To get you started, we refer to two possible places where you can test for your own biases:

Secondly, there is the passive bias. Passive bias occurs where (independently of your bias), the typical behaviour of one group is different than the other. To get you started on this subject, we refer to the MBTI profiles or the -- more recent and more scientific -- the ``theory of the big five personality traits''. The theory identifies five factors:

In both theories men and women are typically[^1] different. For example in the Big Five one finds that women score higher on extroversion, neuroticism, and agreeableness. The combination of those two last dimension implies that men will be (on average) more confident and less likely to accept that there is no salary rise for them. Therefore men will be more likely to ask promotion and salary increase and will therefore also be more likely to obtain it.

[^1]: Typically means here "on average". For example in the MBTI profile, we find that roughly 60% of males are "Thinking", where 60% of females are "Sensing". More information about the MBTI profiles is for example on Wikipedia or myersbriggs.org

This report aims to point out if there are possible areas where bias would have influenced the salaries in your team. We have no claims on where the bias comes from but present the analyses on such way that it becomes actionable.

The Particular Example Behind this Demo Report

library(div)
library(gridExtra)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
teamSize       <- 200
pctFemales     <- 0.45
maleSalaryBias <- 1.05
d <- div_fake_team(seed = 1890122, 
                   N = teamSize, 
                   genders = c("F", "M"),
                   gender_prob = c(pctFemales, 1 - pctFemales),
                   gender_salaryBias = c(1, maleSalaryBias)
                   )

The data used for this report is randomly generated with the following characteristcs:

So, while this data has a bias built in for the Gender Paygap (on average males will earn r paste0(round(maleSalaryBias*100 - 100, 2), "%")) more than females; there is no bias built in for any of the other dimensions that can be studies (e.g. Citizenship)

Diversity

First we present how diverse the team actually is. Diversity is expressed as a number between $0$ and $1$. The diversity is $1$ if the population in the team is exactly the same as the reference population (e.g. the gender diversity equals 1 if the team has 50\% males,^[The country has 50.2\% males in the work population, but that does not seem to be the case in Krakow -- so we use an equal percentage for both genders.] because that would be the same number as the reference population of the country).

Gender diversity per grade

In Figure \ref{fig:gender-gauge} you can see the diversity index for the gender distribution per grade.

```r"} t <- table(d$gender, d$grade) t1 <- apply(t, 2, diversity, prior = c(50.2, 49.8)) t2 <- tibble(value = t1, label = paste("Grade", names(t1))) div_gauge_plot(t2, ncol = 3)

## Age diversity

The diversity in function of age is represented in Figure \ref{fig:ageism}. Here we do not show the distribution per grade, because one can expect a natural correlation between age, experience and grade.

```r"}
# ages in Poland
ageRef <- c(15.94, 27.11, 23.99, 20.68, 12.28) / 100
ad <- table(cut(d$age, breaks = c(20, 30, 40, 50, 60, 65)))
ad <- ad / sum(ad)  # not needed, diversity() does this, bu useful for the table
t <- tibble(value = diversity(ad, prior = ageRef), label = "Age")
div_gauge_plot(t)
t <- rbind(ad, ageRef)
rownames(t) <- c("team", "reference")
knitr::kable(round(t,2), 
             caption = "The percentages of different age groups in this team compared to the population of Poland. An equal distribution would yield a diversity of 1.\\label{tab:div:age}"
             )

To calculate the diversity, we assume that the reference population is distributed as in Table \ref{tab:div:age}. These are the numbers corresponding to Poland.

Diversity in nationalities

While ethnicity is equally valuable in diversity, nationality is measurable, available in most HR databases, and readily quantified. Few people can trace their heritage uniquely to a particular ethinic ascendance and DNA studies reveal that the concept of race is a quite blurred reality -- it is rather a clustering of people that actually display a continuum instead binary racial characteristics.^[A good introduction is here on the website of Harvard University]

#t <- table(d$citizenship, d$grade)
t <- d                     %>% 
     group_by(citizenship) %>% 
     summarise(n = n())    %>%
     arrange(desc(n))      %>%
     ggplot( aes(x = citizenship, y = n))    +
        geom_col(aes(fill = n), width = 0.7) +
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
              axis.title.x=element_blank())

t

t <- d                            %>% 
     group_by(citizenship, grade) %>% 
     summarise(n = n())           %>%
     ggplot( aes(x = grade, y = n))                +
        geom_col(aes(fill = n), width = 0.7)       +
        facet_wrap(~ citizenship, scales="free_y") +
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
                axis.title.x=element_blank())

t

\afterpage{\clearpage}

Inclusion

In previous section, we have shed some light on how diverse the team is. In this section we want to examine how inclusive the team is. We will do this via the salary and unconscious bias. Everyone has certain biases and we can investigate if certain genders or nationalities earn less in similar roles.

We can however not prove that this is the result of bias. We show only the most important control variable: experience (or rather its proxy "age"). There might be many other factors such as starting date in the firm or team, performance, productive behaviour etc. These variables are not part of this exercise, but they can be used by the manager to understand if an observed paygap is equitable or not.

Method

We realise that publishing a ``paygap'' as difference between the average salary of the male and female is not very helpful as this is most likely not measuring a paygap but rather a difference in occupations.

Therefore we only present results of groups that are comparable.

Legend

So, there will be more stars if the probability of a bias is higher: this can be due to a higher bias and/or due to a larger sample size.

To improve readability, the column headers are abbreviated. We summarise here the columns and their content:

The Pay-Ratios (``Pay-Gap'')

The "paygap" as popularly used is actually a ratio and not a gap. If the paygap for example is 0.8 in the the gender analysis and the reference group is females, then it can be understood as follows: "a woman earns 0.8 for every dollar that a man (not woman) earns."

The Gender PayGap

Table \ref{tab:pg:gender} shows the results for the Gender paygap.

pg_gender <- d                 %>%
             div_paygap       
tmp <- div_parse_paygap(pg_gender, label = "tab:pg:gender")
knitr::kable(tmp$d,
             caption = tmp$caption,
             format.args = tmp$format.args,
             label = tmp$label)

The Citizenship PayGap

Table \ref{tab:pg:ctzn} shows the results for the paygap for the main nationality compared to all other nationalities together.

Note that the data did not include a bias in function of citizenship. So any bias that appears below is due to random sampling that used to created the data. That gives you an idea about the balance between being sensitive enough to pick up a bias and not to brand random variations as biased.

pg_citizenship <- d                %>%
             div_paygap(colname = "citizenship" , selectedValue = "Polish")        %>%
             div_round_paygap

tmp <- div_parse_paygap(pg_citizenship, colname = "citizenship", label = 'tab:pg:ctzn')
knitr::kable(tmp$d,
             caption = tmp$caption,
             format.args = tmp$format.args)

Conclusions -- How to Use This Report

This report is a statistical tool to that merely points in a direction where a manager can start looking for potential bias. It cannot prove that bias exits, in all cases it is necessary to understand what is happening and why.

The reason why we cannot prove the existence of bias is that: even where we have rather large numbers and can attribute three stars, the individual performance might be so that the apparent bias is justified.

The best way to us this report is hence to use it a lead that needs investigation. The goal must not be to make sure all stars disappear, but rather that we understand and accept why certain stars are there.

Appendix 1: The Diversity Index

We express diversity as a number between zero and one. Our calculation is based on @debrouwer2020 and more in particular section 36.3.1 ``The Business Case: a Diversity Dashboard''. Details can be found in the book^[In essence, the diversity index $di$ is defined as $di = -\sum{p_i\log{p_i}}$, where $p_i$ are the proportions of the population in class $i4.], but in summary the method has the following key aspects:

```r"} x <- seq(length.out = 500, from = 0, to = 1) t <- cbind(x, 1-x) tt <- tibble(diversity = apply(t, 1, FUN = diversity), x = x) p1 <- ggplot(data=tt, aes(x=x, y=diversity, color = diversity)) + geom_line(lwd = 3) + ggtitle("Diversity index for binary variables and prior probabilities of 50/50") + xlab("Proportion of one of the two classes")

tt <- tibble(diversity = apply(t, 1, FUN = diversity, prior = c(.7,.3)), x = x) p2 <- ggplot(data=tt, aes(x=x, y=diversity, color = diversity)) + geom_line(lwd = 3) + ggtitle("Diversity index for binary variables and prior probabilities of 70/30") + xlab("Proportion of one of the two classes") grid.arrange(p1, p2, ncol = 1)

In Figure \ref{fig:diversityIllustrated} we show how these properties of the diversity index evolve in function of the proportions of different classes. This example is for binary classes, because that is easier to visualise. When more classes are present, then the visualisations are to be done in a space with more dimensions.

# Appendix 2: Further Analysis

What really drives your salaries? Below is a decision tree that (is over-fitted) shows the most determining variables to come to the salary (average numbers in the boxes at the bottom).

```r
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
df <- data.frame(
            isFemale     = ifelse(d$gender == "F", 1, 0),
            grade        = d$grade,
            salary       = range01(d$salary / d$fte),
            jobID        = factor(d$jobID),
            age          = range01(d$age),
            degree       = factor(d$degree),
#            tenure_firm  = range01(d$tenure_firm),
#            tenure_grade = range01(d$tenure_grade),
            tenure_job   = range01(d$tenure_job),
            team         = factor(paste0(d$team, "_", d$subteam)),
            performance  = factor(d$performance)
            )
library(rpart)
library(rpart.plot)
t0 <- rpart(salary ~ ., df, method = "anova",
            control = rpart.control(
              cp = 0.001
            )
)
prp(t0, type = 5, extra = 0, box.palette = "auto", yesno = 1)

Bibliography



Try the div package in your browser

Any scripts or data that you put into this service are public.

div documentation built on May 6, 2021, 9:06 a.m.