wordcloud.mappeR
Gabriel da Silva Zech, Julian Kath and Lorenzo Gini
A package for creating wordcloud maps in R
wordcloud.mappeR
wordcloud.mappeR
is a package for R that allows one to create
wordclouds shaped like regions in a map. Such visualisations are
especially useful when communicating sets of data that consist of many
different variables and each variable is attributed to a specific region
and size of occurrence. Take the example below, a dataset containing the
name of the 100 biggest companies (in terms of estimated number of
employees) for each region in Germany and Italy.
The classification of regions used here follows the European Union’s Nomenclature of Territorial Units for Statistics (NUTS), a geocode standard for referencing the subdivisions of countries. The advantage of using this system is that the classification of regions across countries is standardised and hierarchically structured. For instance, Germany has the base code DE (NUTS 0), the state of Bavaria has the code DE2 (NUTS 1), its subregion of Oberbayern has the code DE21 (NUTS 2) and the city of Munich has the code DE212 (NUTS 3). Since each region is given a unique identifier which is directly linked to the regional level above it, it is fairly easy to identify and match any dataset to these regions.
However, this means that this package currently only works for creating wordcloud maps for EU countries. For an overview of the NUTS regions and levels, you can browse the available maps for each EU country or use this interactive map instead. If you have a dataset containing postcodes and want to convert these to NUTS regions, you can find the correspondence tables here.
wordcloud.mappeR
Currently the package can only be installed through Github. We plan on publishing it to CRAN some time soon.
# install through Github
devtools::install_github("GabZech/wordcloud.mappeR")
# load the package
library(wordcloud.mappeR)
The input data must be in the format of a table (i.e. data frame, tibble) containing three columns with the following data types:
Therefore, this is the minimal structure that the input data requires:
## words frequencies nuts_codes
## 1 <NA> NA <NA>
There are currently two datasets included in the package which we’ve obtained by transforming parts of the 2019 Global Company Dataset made freely available by People Data Labs here. The original dataset contains an estimation of the number of employees in 2019 for over 7 million companies around the world. From this, we produced the following subsets:
companies_DEU
contains the 100 companies with the largest
estimated number of employees for each state (NUTS 1) in Germany.companies_ITA
containes the same type fo data but for the
regions (NUTS 2) in Italy.These can be loaded simply by using data("name_of_dataset")
after
loading the package to your R environment.
data("companies_DEU")
companies_DEU
## # A tibble: 1,600 x 3
## name employees code
## <chr> <dbl> <chr>
## 1 mahle 6136 DE1
## 2 hugo boss 4165 DE1
## 3 lidl 3603 DE1
## 4 festo 3246 DE1
## 5 gft group 2772 DE1
## 6 m+w group 2376 DE1
## 7 mann+hummel group 1982 DE1
## 8 maquet getinge group 1831 DE1
## 9 sick 1683 DE1
## 10 heidelberg 1546 DE1
## # ... with 1,590 more rows
Note: some companies might be attributed to a wrong location in these datasets. This can happen because there were some mistakes and inconsistencies in the given location of each company in the original data. Nevertheless, these are supposed only to serve as example to produce working wordcloud maps, so we have not tried to identify and correct any of these possible misattributions.
wordcloud_map()
The main function of the wordcloud.mappeR
package is
wordcloud_map()
. This is the function that takes in the input data
and plots the wordcloud map according to the parameters defined by the
user. These are the arguments that the function requires and their
specifications:
wordcloud.mappeR::wordcloud_map(dataframe,
country,
level_nuts,
name_column_words,
name_column_frequency,
name_column_nuts,
rm_outside = TRUE,
scale = "10",
png_path = "False")
div.highlight {background-color:#f3f0ff; border-radius: 5px; padding: 20px;}
For example, to reproduce the wordcloud maps shown at the top of this
page, you can pass the following values to the wordcloud_map()
function:
# Wordcloud map for Germany NUTS 1
wordcloud_map(companies_DEU, "DEU", 1, "name", "employees", "code")
# Wordcloud map for Italy NUTS 2
wordcloud_map(companies_ITA, "ITA", 2, "name", "employees", "code")
rm_outside
argumentThe rm_outside
argument is one inherited from the
ggwordcloud
package, which
is used to generate the wordclouds here. It determines whether to
remove words that could not be fitted in the given wordcloud area.
When set to FALSE
, it stacks all these words on top of each other at
the centre of each region. For example, this is how the previous plots
of Germany and Italy look like when rm_outside = FALSE
:
It is not always an issue when words are not able to fit the wordcloud and are removed. The order in which words are plotted is based`on descending frequency values (i.e. the most frequent words are plotted first). So when datasets have too many words, the least important ones (i.e. the ones with lower frequencies) are the ones who will be removed when there is no more space to fit words in the given area.
Nevertheless, there are cases in which important words might be removed against your wish. So here a few reasons why some words will not be able to fit the wordcloud:
A way to try to fix - or at least improve - this issue is by
tweaking the argument max_word_size
(see below). We recommend setting
rm_outside = FALSE
to see which words are not fitting and then
decreasing the max_word_size
until you are happy with the result.
max_word_size
argumentThe max_word_size
argument defines the maximum allowed size for
the words being plotted in the wordcloud. The minimum value is 1
,
where all the words are equally sized, independent of their frequency
values. The default value is 4
, but you might want to try to increase
or decrease this value.
Increasing this number will make words with higher frequency values
stand out more clearly from smaller ones. However, if these words are
too big, they will not fit the wordcloud shape. This will make them
either be plotted on top of other words (if rm_outside = FALSE
) or be
removed completely from the wordcloud (if rm_outside = TRUE
).
scale
argumentThe scale
argument refers to the scale used for the regions’ polygon
shapes. What matters here is whether a smaller or larger scale is
selected, as they have inverse effects on the process and output:
"03"
) mean more detailed polygon
shapes."60"
) mean less detailed and more
“blocky-looking” shapes.Here is an example of how the polygon shape of a region changes according to each different scale (credit giscoR):
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.