knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The glottospace package facilitates the geospatial analysis of linguistic and cultural data. The aim of this package is to provide a streamlined workflow for working with spatio-linguistic data, including data import, cleaning, exploration, visualization and export. For example, with glottospace you can quickly match your own linguistic data to a location and plot it on a map. You can also calculate distances between languages based on their location or linguistic features and visualize those distances. In addition, with glottospace you can easily access global databases such as glottolog, WALS, Grambank, D-PLACE and Phoible from R and integrate them with your own data.
We're still actively developing the glottospace package by adding new functions and improving existing ones. Although the package is stable, you might find bugs or encounter things you might find confusing. You can help us to improve the package by:
If you find glottospace useful, please cite it in your work:
citation("glottospace")
The package uses four global databases: glottolog, WALS, Grambank and Phoible. In addition, glottospace builds on a combination of spatial and non-spatial packages, including sf, terra, tmap, mapview, rnaturalearth, and dplyr. If you use glottospace in one of your publications, please cite these data sources and packages as well.
You can install the latest release of glottospace from CRAN with:
# install.packages("glottospace") # If you receive the message 'loading failed for 'i386', you can try: # install.packages("glottospace", INSTALL_opts = "--no-multiarch")
You can install the development version of glottospace from GitHub with:
# install.packages("devtools") # devtools::install_github("glottospace/glottospace", INSTALL_opts=c("--no-multiarch"))
Before describing the functionality of glottospace, we give a quick demonstration of a typical workflow.
Imagine you're working with languages in a particular region, and want to visualize them on a map. With glottospace this is easy! You could for example filter all languages in South America, and show which ones of them are isolate languages:
library(glottospace) ## Plot point data: glottomap(continent = "South America", color = "isolate")
Languages are often represented with points, while in reality the speakers of a language can inhabit vast areas. glottospace works with point and polygon data. When polygon data is not available, you can interpolate the points and plot those.
## Filter by continent glottopoints <- glottofilter(continent = "South America") # Interpolate points to polygons: glottopols <- glottospace(glottopoints, method = "voronoi") # Plot polygon data: glottomap(glottodata = glottopols, color = "family_size_rank", palette = "tol.yl_or_br")
The glottospace package offers a wide range of functions to work with spatio-linguistic data. The functions are organized into the following function families, of which the core function generally has the same name as the family to which it belongs:
glottoget: download glottodata from remote server, or load locally stored glottodata.
glottocreate: create empty glottodata structure, to add data manually.
glottocheck: run interactive quality checks of user-provided glottodata.
glottoclean: clean-up glottodata.
glottojoin: join user-provided glottodata with other (often online) datasets.
glottosearch: search glottolog database for languages, language families, glottocodes, etc.
glottofilter: filter/subset glottodata based on linguistic and geographic features/variables.
glottodist: calculate differences/similarities between languages based on their features (linguistic, cultural, environmental, geographic, etc.).
glottodist_subdata: calculate differences/similarities between languages based on their constructions.
glottoplot: visualize differences/similarities between languages.
glottospace: make glottodata spatial, add coordinates, add spatial points or polygons to languages.
glottomap: visualize linguistic and cultural data on a map.
glottomap_rips_filt: visualize the Vietoris-Rips filtration.
glottomap_persist_diagram: create the persistence diagram with respect to geospatial points.
glottosave: save output generated by glottospace (data, figures, maps, etc.).
You can load locally stored glottodata (for example from an excel file or shapefile). The glottospace package has two built-in artificial demo datasets ("demodata" and "demosubdata").
glottodata <- glottoget("demodata") head(glottodata)
You can also load glottodata from online databases such as glottolog. You can download a raw version of the data ('glottolog'), or an enriched/boosted version ('glottobase'):
# To load glottobase: glottobase <- glottoget("glottobase") colnames(glottobase)
In the case of Phoible, you can load it in multiple different ways. When you set glottodata to be "phoible_raw" or "phoiblespace_raw", you can load the non-spatial/spatial-enhanced version of raw data of Phoible. In the case of spatial-enhanced version, the Parameter ID's that all values are "absent" are removed.
# To load phoible_raw: phoible_raw <- glottoget(glottodata = "phoible_raw", download = F) # To load phoiblespace_raw: phoiblespace_raw <- glottoget(glottodata = "phoiblespace_raw", download = F)
When glottodata = "phoible" or "phoiblespace", you randomly choose only one sample for all the duplicated glottocodes, and you can load the non-spatial/spatial-enhanced version of Phoible. If you fix the value of seed, you can make a reproducible output. In the case of spatial-enhanced version, the Parameter ID's that all values are "absent" are removed.
phoible <- glottoget(glottodata = "phoible", seed = 42) phoiblespace <- glottoget(glottodata = "phoiblespace", seed = 42)
When glottodata = "phoible_raw_param_sf", you can load an sf object of the geographical distribution for all parameter IDs in raw Phoible.
phoible_raw_param_sf <- glottoget(glottodata = "phoible_raw_param_sf") head(phoible_raw_param_sf)
When glottodata = "phoible_param_sf", the function first randomly choose only one sample for all the duplicated glottocodes, and then load the geographical distribution for all parameter IDs as an sf object. If you set the value of seed, you can then create a reproducible dataset.
phoible_param_sf <- glottoget(glottodata = "phoible_param_sf", seed = 42) head(phoible_param_sf)
You can generate empty data structures that help you to add your own data in a structured way. These data structures can be saved to your local folder by specifying a filename (not demonstrated here).
glottocreate(glottocodes = c("yucu1253", "tani1257"), variables = 3, meta = FALSE)
We've specified meta = FALSE, to indicate that we want to generate a 'flat' glottodata table. However, when creating glottodata, by default, several meta tables are included:
glottodata_meta <- glottocreate(glottocodes = c("yucu1253", "tani1257"), variables = 3) summary(glottodata_meta)
The majority of these meta tables are added for the convenience of the user. The 'structure' and 'sample' tables are the only ones that are required for some of the functions in the glottospace package. A structure table can also be added later:
glottodata <- glottoget("demodata", meta = FALSE) structure <- glottocreate_structuretable(varnames = c("var001", "var002", "var003")) glottodata <- glottocreate_addtable(glottodata, structure, name = "structure")
More complex glottodata structures can also be generated. For example, in cases where you want to distinguish between groups within each language.
# Instead of creating a single table for all languages, you might want to create a list of tables (one table for each language) glottocreate(glottocodes = c("yucu1253", "tani1257"), variables = 3, groups = c("a", "b"), n = 2, meta = FALSE)
If you have your own data, you might want to do some interactive quality checks:
glottodata <- glottoget("demodata") glottocheck(glottodata, diagnostic = FALSE)
We've now specified diagnostic = FALSE, but the default is to show some more extensive diagnostics (like a data coverage plot).
You can also check the metadata:
glottodata <- glottoget(glottodata = "demodata", meta = TRUE) glottocheck(glottodata, checkmeta = TRUE)
Once you've loaded glottodata, you might encounter some inconsistencies. For example, data-contributors might not have used a standardized way of coding missing values.
glottodata <- glottoget(glottodata = "demodata", meta = TRUE) glottodata_clean <- glottoclean(glottodata) glottodata$glottodata glottodata_clean$glottodata
Join user-provided glottodata with other datasets, or with online databases.
# Join with glottospace glottodata <- glottoget("demodata") # Add data from glottolog: glottojoin(glottodata, with = "glottobase") # Simplify glottosubdata (join a list of glottodata tables into a single table) glottosubdata <- glottocreate(glottocodes = c("yucu1253", "tani1257"), variables = 3, groups = c("a", "b"), n = 2, meta = FALSE) glottosimplify(glottodata = glottosubdata)
As demonstrated in the example above, you can search glottodata for a specific search term
You can search for a match in all columns:
glottosearch(search = "yurakar")
Or limit the search to specific columns:
glottosearch(search = "Yucuni", columns = c("name", "family"))
Sometimes you don't find a match:
glottosearch(search = "matsigenka")[,"name"]
If you can't find what you're looking for, you can increase the tolerance:
glottosearch(search = "matsigenka", tolerance = 0.2)[,"name"]
Aha! There it is: 'Machiguenga'
glottosearch(search = "matsigenka", tolerance = 0.4)[,"name"]
filter, select, query
eurasia <- glottofilter(continent = c("Europe", "Asia")) eurasia # Other examples of glottofilter: wari <- glottofilter(glottocode = "wari1268") indo_european <- glottofilter(family = 'Indo-European') south_america <- glottofilter(continent = "South America") colovenz <- glottofilter(country = c("Colombia", "Venezuela")) arawtuca <- glottofilter(expression = family %in% c("Arawakan", "Tucanoan"))
You can also interactively filter languages by drawing or clicking on a map:
# selected <- glottofiltermap(continent = "South America") # glottomap(selected)
Quantify differences or disimilarities between languages glottodistances: calculating disimilarities between languages based on linguistic/cultural features.
The glottodist
can compute two types of distance/dissimilarity by
setting the argument metric
. When metric="gower"
, it returns the
Gower distance, when metric="anderberg"
, it returns the Anderberg
dissimarity. The default value is metric="gower"
.
# In order to be able to calculate linguistic distances a structure table is required, that's why we specify meta = TRUE. In case you have glottodata without a structure table, you can add it (see examples in the glottocreate() section). glottodata <- glottoget("demodata", meta = TRUE) glottodist_gower <- glottodist(glottodata = glottodata, metric = "gower") glottodist_anderberg <- glottodist(glottodata = glottodata, metric = "anderberg")
You can also compute the differences or disimilarities between languages
based on constructions. The function glottodist_subdata
contains three
types of disimilarities, i.e., "Matching Constructions Index (MCI)",
"Relative Index (RI)", and "Form-Meaning Index (FMI)" based on either
gower distance or Anderberg distance.
For example, suppose if the language "tani1257" contains three constructions "tani1257_0001", "tani1257_0002", "tani1257_0003", and the language "yucu1253" contains two constructions "yucu1253_0001", "yucu1253_0002".
knitr::kable(glottospace::glottoget("demosubdata_cnstn")[["tani1257"]])
knitr::kable(glottospace::glottoget("demosubdata_cnstn")[["yucu1253"]])
The Matching Constructions Index of tani1257 and yucu1253 w.r.t. gower distance is given by
glottoget(glottodata = "demosubdata_cnstn") |> glottodist_subdata(metric = "gower", index_type = "mci")
If the meaning features are "var001", "var002", "var003", "var004", and the form features are "var005", "var006", "var007", the Atomic Meaning Index of tani1257 and yucu1253 w.r.t. gower distance is given by
glottoget(glottodata = "demosubdata_cnstn") |> glottodist_subdata(metric = "gower", index_type = "ri", avg_idx = 1:4, fixed_idx = 5:7)
and the Atomic Form Index of tani1257 and yucu1253 w.r.t. gower distance is given by
glottoget(glottodata = "demosubdata_cnstn") |> glottodist_subdata(metric = "gower", index_type = "ri", avg_idx = 5:7, fixed_idx = 1:4)
and the Form-Meaning Index is given by
glottoget(glottodata = "demosubdata_cnstn") |> glottodist_subdata(index_type = "fmi", avg_idx = 1:4, fixed_idx = 5:7)
Visualizing differences (distances) between languages based on linguistic, cultural, and environmental features.
glottodata <- glottoget("demodata", meta = TRUE) glottodist <- glottodist(glottodata = glottodata) glottoplot(glottodist = glottodist)
This family of functions turns glottodata into a spatial object. As we've illustrated above, these can be either glottopoints or glottopols
glottodata <- glottoget("demodata") glottospacedata <- glottospace(glottodata, method = "buffer", radius = 5) # By default, the projection of maps is equal area, and shape is not preserved: glottomap(glottospacedata)
With glottomap you can quickly visualize the location of languages. Below we show simple static maps, but you can also create dynamic maps by specifying type = "dynamic".
To select languages, you don't need to call glottofilter() first, but you can use glottomap() directly. Behind the scenes glottomap() passes those arguments on to glottofilter().
glottomap(country = "Colombia")
However, you can also create maps with other glottodata. For example, we might want to create a world map highlighting the largest language families
glottodata <- glottoget() families <- dplyr::count(glottodata, family, sort = TRUE) # highlight 5 largest families: glottodata <- glottospotlight(glottodata = glottodata, spotcol = "family", spotlight = families$family[1:5], spotcontrast = "family") # Create map glottomap(glottodata, color = "legend", glotto_title = "Top 5 largest languages families", ptsize=.3, palette = "Set 1", alpha = .8)
You can also produce maps based on either countries or hydro-basins. The hydro-basin basemap is cited from HydroSHEDS of Level 03.
par(mar = c(2, 2, .1, .1)) ## Filter by continent glottopoints <- glottofilter(continent = "South America") ## Sort the families families <- dplyr::count(glottopoints, family, sort = TRUE) ## Plot the top 5 languages based on countries glottomap(glottopoints[glottopoints$family %in% families$family[1:5], ], basemap = "country", color = "family", palette = "paired", alpha = .9) ## Plot the top 5 languages based on hydro-basins glottomap(glottopoints[glottopoints$family %in% families$family[1:5], ], basemap = "hydro-basin", color = "family", palette = "paired", alpha = .9)
You can visualize the topological pattern of spatial distribution of
languages using function glottomap_rips_filt()
. More technically, this
method of visualization is referred to as "Vietoris-Rips
filtration".
In the function glottomap_rips_filt()
, the default units of r
and
maxscale
are both 100km, corresponding to the buffer radius of each
language and the maximum value of the Vietoris-Rips filtration
respectively.
For example, if we filter out the languages of Arawakan family, and set the buffer radius of each language as 600 km, the maximum value of the Vietoris-Rips filtration as 1500 km, we can see that there is a circluar structure around the borders of Colombia, Peru and Brazil:
# Pick up all the Arawakan languages glottopoints <- glottofilter(continent = "South America") awk <- glottopoints[glottopoints$family == "Arawakan", ] # Create a static Vietoris-Rips filtration map glottomap_rips_filt(glottodata = awk, r = 6, maxscale = 15, is_animate = FALSE)
If you set the the argument is_animate = TRUE
, you can then generate a
GIF file, here the buffer radius ranges from 0 km to 800 km:
glottopoints <- glottofilter(continent = "South America") # Pick up all the Arawakan languages awk <- glottopoints[glottopoints$family == "Arawakan", ] # Create a GIF file of Vietoris-Rips filtration map glottomap_rips_filt(glottodata = awk, r = 8, maxscale = 15, is_animate = TRUE, length.out = 30, movie.name = "filtration.gif")
The persistent topological structure (clusters and circular structures) can be described by the so-call "persistence diagram", you can see that there is a unique point of dimension 1 standing out of the diagonal line when "Birth" is around 600 km and "Death" is around 800 km, this is consistent with the precious visualization of Vietoris-Rips filtration:
# Pick up all the Arawakan languages glottopoints <- glottofilter(continent = "South America") awk <- glottopoints[glottopoints$family == "Arawakan", ] # Create a static Vietoris-Rips filtration map glottomap_persist_diagram(awk, maxscale = 15)
All output generated with the glottospace package (data, figures, maps, etc.) can be saved with a single command.
glottodata <- glottoget("demodata", meta = FALSE) # Saves as .xlsx # glottosave(glottodata, filename = "glottodata") # Saves as .GPKG glottospacedata <- glottospace(glottodata) # glottosave(glottospacedata, filename = "glottospacedata") # By default, static maps are saved as .png, dynamic maps are saved as .html glottomap <- glottomap(glottodata) # glottosave(glottomap, filename = "glottomap")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.