In michaelgavin/geosemantics: Language of Place (Code, Data, and Supplemental Materials)

Introduction

Chapter 1 studies the language of geographical description in English from the years 1473 to 1700. This chapter introduces the basic premise of historical geospatial semantics, an interpretive paradigm that studies the history of geographical thinking by tracing changes in the way places are described.

By inserting the question of place into a study of meaning, geosemantics invites a style of analysis that constantly toggles back and forth between two perspectives: the place perspective and the word perspective. From the place perspective, we might ask: "What are all the ways Paris has been described, and how is Paris different from London?" From the word perspective, we might ask: "Where are 'natives' most often referred to, and how are 'natives' in America different from 'natives' in Europe?"

The trick of geosemantics is to shuffle a corpus around so that these kinds of patterns can be identified.

The Data

Mathematically, categories like lists of places, words, and documents are represented as vector spaces. A vector is a fixed list of elements, like a row or a column of a spreadsheet. What makes a group of vectors a space is that each of those elements can be added together, scaled, or otherwise distorted. The term vector space names what we might think of colloquially as the range of possibility over which any sequence of numbers might be scaled or transformed, while its general shape remains preserved.

W is a vector space of 22,542 dimensions. Each vector w in W represents a word found in EEBO. W is composed of two parts, a list of place names, P, and a list of keywords, K.

P is a vector space of 15,035 dimensions. Each element p in P is a unique toponym found in EEBO.

K is a vector space of 7,507 dimensions. Each element k in K is a high frequency word from EEBO. (We built this list by sampling from across the period to include various spellings of major words.)

D is a vector space of length 59,226. Each element d in D is a document transcribed and marked up by the Text Creation Partnership.

A matrix is a two-dimensional array of numbers that describes how two vector spaces relate to each other. It represents a linear map that transforms the space of one vector onto the space of another. The linear map we'll work with transforms the space of words W over the space of documents D.

The analysis in this chapter hinges on manipulating that matrix in various ways. For practical purposes of managing computer memory, we'll work with two kinds of subspaces, limited to P and K:

the place-document matrix shows how often places are mentioned in each document.
the keyword-document matrix shows how often keywords appear in documents

In the geosemantics R-package, even these subsetted matrices would be too big for R's memory to handle individually, so they're broken up by date into 5 sections.

1473 to 1599 (5,172 docs)
1600 to 1639 (7,837 docs)
1640 to 1659 (17,058 docs)
1660 to 1679 (11,108 docs)
1680 to 1699 (18,051 docs)

Lastly, the place-document frequencies are gathered in a single table sorted by year, which shows the sum of the place-name frequencies for all documents dated each year:

the place-year matrix shows how often places are mentioned each year, 1473 to 1699.

The Metadata

The TEI files prepared by the Text Creation Partnership include detailed metadata in their headers. This includes bibliographic information about the original sources. In the analyses below, we will often organize documents by date, but otherwise we will basically ignore information like author, title, etc.

More important are two sets of metadata about EEBO's places. One is a table of longitude and latitude coordinates, taken from several geographical dictionaries published in the 17th century. Another is a gazetteer of places, also taken from the primary sources, which identifies each place within a global topology.

The gazetteer is structured in a subject-predicate-object tuple format. Entries in the gazetteer look like this:

SUBJECT | PREDICATE | OBJECT ---------|-------------|------- paris | instance of | city paris | is in | france france | instance of | country france | is in | europe luetetia | is same as | paris

Where does this information come from? It comes from seventeenth-century collections of geographical dictionaries. Here are the first few sentences from the relevant entry for Paris, in Edward Bohun's A geographical dictionary representing the present and ancient names of all the counties, provinces, remarkable cities, universities, ports, towns, mountains, seas, streights, fountains, and rivers of the whole world (1693):

Paris, Leutetia, Luotetia, Lucetia, Leucotetia, Parisii, and Lutetia Parisiorum, the Capital City of the Kingdom of France; boasted by Baudrand, to be the greatest City of Europe; with a Nemine reclamante, no body denying it to be so. This was a celebrated City in the Times of the Roman Empire. Julian the Apostate (whilst he was Caesar only) resided here in the Reign of Constantius: and adorned it with Baths and a Palace. But its greatest Rise was from the Franks: Clodoveus settling the Royal Throne in this City, about the year 458.

Place descriptions like these are highly formulaic, and the geographical data curated for this chapter were put together by semi-automatically scrolling through such books and converting relevant information to the simple tuple format expressed above. Only three kinds of relations were gathered, categorical definitions ("instance of"), containment relations ("is in"), and equivalence relations ("is same as").

The goal is to be able to gather all descriptions of a place, even if the place had different names at different times. (Spelling variations were common.) Also, I want the option to search descriptions of larger geographical units, not just for uses of the word Germany but also for descriptions of Germany's provinces, cities, towns, and rivers. I want the option to compile them into a single statistical profile.

Begin by loading the gazetter data and selecting rows with information about Paris:

library(geosemantics)
library(ggplot2)
data("ch1_gazetteer")
ord = which(gaz$SUBJECT == "paris" | gaz$OBJECT == "paris")
gaz[ord,]

You'll notice some idiosyncrasies in the data. Much information is repeated across different sources. The designation "is same as" is, on the face of it, rather vague and shouldn't be interpreted too literally. It gathers together two different correspondence relations that might benefit from better specification: "formerly known as" and "alternate spelling of." The analysis in this chapter collapses those relationships into a single composite sense of sameness, such that all values corresponding to Paris are either treated individually or get added together and measured as a common unit.

The function that performs this collapse is called place_join. It reads over the gazetteer and returns all place names that are related over the gazetteer.

place_join("france", gaz = gaz, mode = "same")

To include places that are inside a larger place, use mode = "all

place_join("france", gaz = gaz, mode = "all")[1:20]

Selecting mode = "all" thus returns not only "france" and its various spellings, but also places in France, like "anjou", "artois", or "aquitain."

Once related places are identified, their entries in the datasets can be compiled. Using the place-year matrix, for example, we can see how all references to French places are distributed over time throughout EEBO.

data("ch1_place_year")
french_places = place_join("france", gaz = gaz, mode = "all")
place_composition(mat = py, places = french_places)[28:47]

This little pass through the data shows spikes of interest in French places in the years 1502, 1506, and 1515. We could dig into those years to find which documents are most responsible for those spikes, but instead let's look over EEBO as a whole.

First let's trim the 1400s data from the set, because it's so sparse it tends to distort the visualizations.

py = py[,27:226]
py = as.matrix(py)

Now let's look at references to France (including all French places), from 1500 through 1699.

plot_timeseries(mat = py, places = "france", compose_places = F)

For the first hundred years of the corpus, "france" represent only a small proportion of the corpus, but it bursts into the scene around the year 1600. We might be tempted to find a historical explanation for this trend -- perhaps during Queen Elizabeth I's reign France gained some important new prominence in English discourse --

but it turns out just to be an artifact of spelling changes.

plot_timeseries(mat = py, places = "fraunce", compose_places = F)

To get around such problems, the parameter compose_places gathers together all variations of a placename, as well as any smaller places it might contain. So this graph counts not just references to the word type "france" but also "fraunce" and other places, like "paris," "brittany," and "lourraine" that France contains.

plot_timeseries(mat = py, places = "france", compose_places = T)

Turns out that references to French and British places mirror each other quite closely.

plot_timeseries(mat = py, places = c("france","britain"), compose_places = T)

It's also possible to look at continents. Where was attention focused?

continents = gaz$SUBJECT[gaz$OBJECT == "continent"]
continents = unique(continents)
plot_timeseries(mat = py, places = continents, compose_places = T)

Let's look just at America

plot_timeseries(mat = py, places = "america", compose_places = T)

european_places = place_join("europe", gaz = gaz)
british_places = place_join("britain", gaz = gaz)
european_places = european_places[european_places %in% british_places == F]

british_places = british_places[british_places %in% rownames(py)]
british_freqs = colSums(py[british_places,]) / colSums(py)

european_places = european_places[european_places %in% rownames(py)]
european_freqs = colSums(py[european_places,]) / colSums(py)

place = rep("britain", 200)
year = 1500:1699
count = british_freqs

PLACE = c(place, rep("europe (ex. britain)",200))
YEAR = c(year,1500:1699)
COUNT = c(count, european_freqs)
df = data.frame(PLACE, YEAR, COUNT)
ggplot(df, aes(x = YEAR, y = COUNT, color = PLACE)) + 
    geom_point() + 
    stat_smooth(method = loess, aes(group = PLACE), se = F, fullrange = F) + 
    theme(panel.background = element_blank(),
          axis.title = element_blank()) +
    scale_x_discrete(breaks=c("1500","1600","1699"),
                     labels=c("1500", "1600", "1700"))

plot_timeseries(mat = py, places = c("egypt"), compose_places = T)

plot_timeseries(mat = py, places = c("rome"), compose_places = T)

plot_timeseries(mat = py, places = c("israel"), compose_places = T)

plot_timeseries(mat = py, places = c("china","india"), compose_places = T)

plot_timeseries(mat = py, places = c("peru","mexico"), compose_places = T)

plot_timeseries(mat = py, places = c("boston","jamaica"), compose_places = T)

plot_timeseries(mat = py, places = c("edinburgh","dublin"), compose_places = T)

plot_timeseries(mat = py, places = c("oxford","cambridge"), compose_places = T)

plot_timeseries(mat = py, places = c("england","rome","israel"), compose_places = T, mode = "same")

NEXT STEP:

COMPOSE TABLE OF TOP 5 TOPONYMS, BY 20-YEAR PERIOD, 1500-1699

PLACE = c()
COUNT = c()
PERIOD = c()
for (i in 1:5) {
  print(i)
  fname = paste("ch1_pd",i,".rda", sep = "")
  load(system.file("extdata", fname, package = "geosemantics"))
  pd = pd[setdiff(rownames(pd),stops),]
  totals = rowSums(pd)
  totals = sort(totals, decreasing = T)[1:50]
  toponyms = names(totals)
  composed_pd = pd[toponyms,]
  for (j in 1:length(toponyms)) {
    place = toponyms[j]
    places = place_join(place, gaz = gaz, mode = "same")
    vec = place_composition(pd, places)
    composed_pd[place,] = vec
  }
  ct = rowSums(composed_pd)
  ct = sort(ct, decreasing = T)[1:5]
  place = names(ct)
  period = rep(i, 5)
  PLACE = c(PLACE,place)
  COUNT = c(COUNT,ct)
  PERIOD = c(PERIOD,period)
}
df = data.frame(PLACE,COUNT,PERIOD)

In the essay, I briefly mention doing a regression analysis, comparing the sixteenth and seventeenth centuries. Here's the process.

slopes = matrix(0, nrow(py), 2)
colnames(slopes) = c("pre","post")
rownames(slopes) = rownames(py)
composed_py = py
for (i in 1:nrow(py)) {
  place = rownames(py)[i]
  places = place_join(place, gaz = gaz, mode = "same")
  vec = place_composition(py, places)
  composed_py[i,] = place_composition(place = rownames(py)[i], mat = py)
}

totals = colSums(composed_py)
normed_py = composed_py
for (j in 1:ncol(normed_py)) {
  normed_py[,j] = normed_py[,j] / sum(normed_py[,j])
}
for (i in 1:nrow(composed_py)) {
  print(i)
  vec = normed_py[i,]
  x = seq(from = 0, to = 1, length.out = 100)
  fit = lm(vec[1:100] ~ x)
  slope1 = fit$coefficients[2]
  fit = lm(vec[101:200] ~ x)
  slope2 = fit$coefficients[2]
  slopes[i,] = c(slope1, slope2)
}

Note: Actually, the above table is pretty useless. Top three toponyms, across all five periods, are Rome, England, and Israel. That's all I need to say.

Semantic mapping

Start by drawing a few simple semantic maps: showing the distribution of words. Then move to talk about semantic drift.

Never uses retina figures
Has a smaller default figure size
Uses a custom CSS stylesheet instead of the default Twitter Bootstrap style

Vignette Info

Note the various macros within the vignette section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title field and the \VignetteIndexEntry to match the title of your vignette.

Styles

The html_vignette template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:

output: 
  rmarkdown::html_vignette:
    css: mystyles.css

Figures

The figure sizes have been customised so that you can easily put two images side-by-side.

tf_idf = function(mat) {
  #browser()
  idf = apply(mat, 2, function(x) log(length(x) / length(x[x > 0])))
  mat = apply(mat, 1, function(x) x * idf)
  return(mat)
}

You can enable figure captions by fig_caption: yes in YAML:

output:
  rmarkdown::html_vignette:
    fig_caption: yes

Then you can use the chunk option fig.cap = "Your figure caption." in knitr.

More Examples

You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable().

knitr::kable(head(mtcars, 10))

Also a quote using >:

"He who gives up [code] safety for [code] speed deserves neither." (via)

michaelgavin/geosemantics documentation built on May 7, 2019, 3:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

michaelgavin/geosemantics
Language of Place (Code, Data, and Supplemental Materials)

In michaelgavin/geosemantics: Language of Place (Code, Data, and Supplemental Materials)

Introduction

The Data

The Metadata

Semantic mapping

Vignette Info

Styles

Figures

More Examples

R Package Documentation

Browse R Packages

We want your feedback!

michaelgavin/geosemantics Language of Place (Code, Data, and Supplemental Materials)

In michaelgavin/geosemantics: Language of Place (Code, Data, and Supplemental Materials)

Introduction

The Data

The Metadata

Semantic mapping

Vignette Info

Styles

Figures

More Examples

R Package Documentation

Browse R Packages

We want your feedback!

michaelgavin/geosemantics
Language of Place (Code, Data, and Supplemental Materials)