# Fitting a multilevel index of segregation in R: using the MLID package" In MLID: Multilevel Index of Dissimilarity

## Introduction

This tutorial introduces the tools and functions available in the MLID package to fit a multilevel index of dissimilarity, a measure of ethnic or social segregation that captures both of the two principal dimensions of segregation - unevenness and spatial clustering - and looks for scale effects as well as the contributions of particular places to the index value.

To begin, install the package from CRAN by typing

install.packages("MLID")


Or, for the latest development version:

# Needs devtools. Use: install.packages("devtools")
require(devtools)
devtools::install_github("profrichharris/MLID")


require(MLID)


### About the Index of Dissimilarity

The index of dissimilarity (ID) widely is used in social and demographic research and examines whether the places where one population group are most likely to be located are the places where another group is most likely to be present too. The logic of the index is that if, for example, 1 per cent of population group Y resides in a neighbourhood then, all things being equal, 1 per cent of population group X ought to reside there too. If another neighbourhood is a little bigger and contains 2 per cent of all the Y group then it should contain 2 per cent of all the X group as well. In this way, if the share of the Y group is equal to the share of the X group in each and every neighbourhood then the two populations are said to have an even geographical distribution, described as a situation of 'no segregation'. However, if wherever the Y population is found, X is not (and vice versa) then there is a situation of 'complete segregation'.

The ID measures unevenness - how unevenly the two groups are distributed across the study region relative to one another and regardless of how big or small each group is in the total population (all that matters is the share of each group in each neighbrouhood). However, unevenness is only one of the two principal dimensions of segregation. The other is spatial clustering. Although the ID measures the scale of segregation in a numeric sense, giving an amount of segregation, it does not do so in a geographic sense. The classic example is to compare a checkerboard-style pattern of alternating back-white squares with other patterns that have increasing amounts of spatial clustering. In each of the examples below, the ID is the same, showing complete black-white segregation, yet the pattern of spatial clustering is not.[^1]

[^1]: The 'stray' cell in examples 2-4 is to allow the model to be fitted. With it, the model correctly identifies that some of the variation remains at the base level.

A multilevel index of dissimilarity (MLID) improves upon the standard ID by capturing both the unevenness and the clustering. To see this, run the examples in the MLID package. Note that although the ID value is always 1.000 the other measures, Pvariance and Holdback, change with the geographical scale of segregation. Those other measures are explained later. All that matters for now is that they are sensitive to the pattern of spatial clustering whereas the standard ID is not.

checkerboard()


x <- c(rep(c(1,0), times=8), rep(c(0,1), times=8))
x <- matrix(x, nrow=16, ncol=16)
y <- abs(1-x)

n <- length(x)
dd <- dim(x)
rows <- 1:dd[1]
cols <- 1:dd[2]
grd <- expand.grid(cols, rows)
r2 <- ceiling(grd/2)
ID2 <- paste("A",r2$Var1,"-",r2$Var2, sep="")
r4 <- ceiling(grd/4)
ID4 <- paste("B",r4$Var1,"-",r4$Var2, sep="")
r8 <- ceiling(grd/8)
ID8 <- paste("C",r8$Var1,"-",r8$Var2, sep="")
gridcodes <- data.frame(ID=1:n, TwoBy2 = ID2, FourBy4 = ID4, EightBy8 = ID8)

grd <- raster::raster(x)
print(sp::spplot(grd, colorkey = FALSE,
col.regions = colorRampPalette(c("white", "black"))))

x <- rep(c(1,1,0,0), times=8)
x <- c(x, rep(c(0,0,1,1), times=8))
x <- matrix(x, nrow=16, ncol=16)
x[min(which(x==0))] <- 1
y <- abs(1-x)

grd <- raster::raster(x)
print(sp::spplot(grd, colorkey=FALSE,
col.regions = colorRampPalette(c("white", "black")),
border = "grey"))

x <- rep(c(1,1,1,1,0,0,0,0), times=8)
x <- c(x, rep(c(0,0,0,0,1,1,1,1), times=8))
x <- matrix(x, nrow=16, ncol=16)
x[min(which(x==0))] <- 1
y <- abs(1-x)

grd <- raster::raster(x)
print(sp::spplot(grd, colorkey = FALSE,
col.regions = colorRampPalette(c("white", "black")),
border = "grey"))

x <- rep(c(rep(1,8),rep(0,8)), times=8)
x <- c(x, rep(c(rep(0,8),rep(1,8)), times=8))
x <- matrix(x, nrow=16, ncol=16)
x[min(which(x==0))] <- 1
y <- abs(1-x)

grd <- raster::raster(x)
print(sp::spplot(grd, colorkey = FALSE,
col.regions = colorRampPalette(c("white", "black")),
border = "grey"))


Figure 1. Each of these patterns generates the same ID value yet they represent different degrees of spatial clustering. The multilevel index distinguishes between them.

## Calculating the ID and MLID

The index of dissimilarity is calculated as $$\text{ID}=k\times\sum_i{\big|\frac{n_{yi}}{n_{y+}}-\frac{n_{xi}}{n_{x+}}\big|}$$ where $n_{yi}$ is the count of population group Y in neighbourhood $i$, $n_{y+}$ is the total count of Y across all neighbourhoods in the study region ($n_{y+} = \sum_i{n_{yi}}$), and $n_{xi}$ and $n_{x+}$ are the corresponding values for population group X. Setting the scaling constant to be $k = 0.5$ means that the maximum range for the ID is from 0 to 1.

The index summarises the differences between a set of observed values, $y_i = n_{yi}/n_{y+}$ and what those values would be under an expectation of 'zero segregation', $x_i = n_{xi}/n_{x+}$, which is when the share of the Y population per neighbourhod everywhere is equal to the share of the X population. Substituting $y_i$ and $x_i$ for $n_{yi}/n_{y+}$ and $x_i = n_{xi}/n_{x+}$ in the formula gives $$\text{ID}=0.5\sum_i{|y_i-x_i|}$$ Writing this within a regression framework, $$y_i=\beta_0 + \beta_1x_i+\epsilon_i$$ Setting $\beta_0 = 0$ and $\beta_1 = 1$, and rearranging gives $$\epsilon_i = y_i - x_i$$ from which the ID can be calculated as $$\text{ID} = 0.5\sum_i|\epsilon_i|$$ This shows that the ID is half the sum of the absolute values of the residuals from a regression model where the dependent variable is the share of the Y population per neighbourhood, the intercept is zero and there is an offset, which is the share of the X population.

The multilevel model is achieved by estimating what of the residuals is due to different levels of a geographic hierarchy. For example, for a four level model where neighbourhoods at level $i$ group into districts at level $j$, those into larger administrative authorities at level $k$, and then into regions at level $l$, the residuals can be estimated as $$\epsilon_i = \hat\lambda_i + \hat\mu_j + \hat\nu_k + \hat\xi_l$$ giving $$\text{ID}= 0.5\sum_i|\hat\lambda_i + \hat\mu_j + \hat\nu_k + \hat\xi_l|$$ The geographical scales of segregation are then explored by looking at the residuals at each level, as the following case study demonstrates

## Case Study

### Fitting and exploring the standard ID

The data frame

require(MLID)

data(ethnicities)


contains counts of various ethnic groups living in census small areas in England and Wales in 2011. Those small areas are called Output Areas (OAs).

head(ethnicities, n = 3)


To calculate the index of dissimilarity for the residential segregation of the Bangladeshi from the White British, we may use

index <- id(ethnicities, vars = c("Bangladeshi", "WhiteBrit"))
index


which generates an ID value of r index[1]. The interpretation is that r index[1] * 100 per cent of either the Bangladeshi or White British populations would need to move for both to be evenly distributed relative to one another. It seems a lot and reflects the concentration of the Bangladeshi population in particular parts of the country such as London, and especially the Boroughs of Tower Hamlets and Newham within the capital, which the following 'impact' calculations reveal.

impx <- impacts(ethnicities, c("Bangladeshi", "WhiteBrit"), c("LAD","RGN"))



### Refitting the multilevel index

Having identified the 'outliers', a next step is to refit the multilevel index with Tower Hamlets, Newham and E02001113 omitted.

newindex <- id(aggdata, vars = c("Bangladeshi", "WhiteBrit"), levels = c("MSOA","LAD","RGN"), omit = c("Tower Hamlets", "Newham", "E02001113"))
newindex


The ID increases slightly from r index[1] to r newindex[1] but the more interesting change is in the measure of spatial clustering, Pvariance. This has changed from

attr(index, "variance")


to

attr(newindex, "variance")


which is an increase/decrease of

attr(newindex, "variance") - attr(index, "variance")


What it reveals is a 'step down' from the LAD to the MSOA and LSOA (Base) scales.

Overall, the following observations may be drawn:

• The residential segregation of the Bangladeshi from the White British is high across England and Wales (although actually it decreased from the 2001 to the 2011 Census)
• The scale of segregation is highest at the local authority (LAD) scale
• That is because of the effects of Tower Hamlets and Newham
• Omitting Tower Hamlets and Newham (and also MSOA E02001113) leaves the dominant scales of segregation as the MSOA and LSOA levels

Within the segregation literature there has been a movement away from measuring ethnic segregation at a single scale and using traditional indices, to treating segregation as a multiscale phenomenon about which measurement at a range of scales will shed knowledge. That literature has been the inspiration for this work. Amongst the contributions, several authors have promoted multilevel modelling as a way of looking at segregation at multiple scales of a geographic hierarchy simultaneously. The MLID package takes forward the approach by outlining a multilevel index of dissimilarity that combines the advantages of using a widely-understood index with a means to identify scale effects in a way that is computationally fast to estimate and easily fitted in R.

### Acknowledgements

My thanks to Dewi Owen for thoughtful observations and comments, and for good company

The package development was funded partly under the ESRC’s Urban Big Data Centre, grant ES/L011921/1.

Census data: Office for National Statistics; National Records of Scotland; Northern Ireland Statistics and Research Agency (2016): 2011 Census aggregate data. UK Data Service (Edition: June 2016). DOI: (http://dx.doi.org/10.5257/census/aggregate-2011-1). The information is licensed under the terms of the Open Government Licence (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3).

The LSOA, MSOA, LAD and RGN codes are from (http://bit.ly/2lGMdkE) and are supplied under the Open Government Licence: Contains National Statistics data. Crown copyright and database right 2017.

### References

Harris R 2017 Measuring the scales of segregation: Looking at the residential separation of White British and other school children in England using a multilevel index of dissimilarity, Transactions of the Institute of British Geographers in press

Jones K Johnston R Manley D Owen D and Charlton C 2015 Ethnic Residential Segregation: A Multilevel Multigroup Multiscale Approach Exemplified by London in 2011 Demography 52 1995-2019

Leckie G and Goldstein H 2015 A multilevel modelling approach to measuring changing patterns of ethnic composition and segregation among London secondary schools 2001–2010 Journal of the Royal Statistical Society Series A 178 405-424

Leckie G Pillinger R Jones K and Goldstein H 2012 Multilevel modelling of Social Segregation Journal of Educational and Behavioral Statistics 37 3-30

Manley D Johnston R Jones K and Owen D 2015 Macro- Meso- and Microscale Segregation: Modeling Changing Ethnic Residential Patterns in Auckland New Zealand 2001-2013 Annals of the Association of American Geographers 105 951-967

Owen D 2015 Measuring residential segregation in England and Wales: a model-based approach Unpublished PhD thesis School of Geographical Sciences, University of Bristol

## Try the MLID package in your browser

Any scripts or data that you put into this service are public.

MLID documentation built on May 2, 2019, 11:05 a.m.