verbatim: Read or Write Fixed-Column-Width Community Tables From or To...
In vegsoup: Classes and Methods for Phytosociology

Description Usage Arguments Details Value Note Author(s) References See Also Examples

The function read.verbatim(file) reads data from text files as encountered in published tables typesetted with monospaced fonts. It's main purpose is to help digitizing literature data. The corresponding write.verbatim(obj) method performs the reverse operation on objects inheriting from class Vegsoup.

read.verbatim(file, colnames, layers, replace = c("|", "-", "_"),
              species.only = FALSE, vertical = TRUE, verbose = FALSE)

read.verbatim.append(x, file, mode = c("plots", "species", "layers"),
                     collapse = ",", abundance)

castFooter(file, schema = c(":", ",", " "), species.first = FALSE,
           abundance.first = TRUE, multiple = TRUE, abundance = "+", layers)

header(x)

## S4 method for signature 'Vegsoup'
write.verbatim(obj, file, select, absence = ".",
          sep = " ", pad = 1, abbreviate = TRUE, short.names = FALSE,
          rule, add.lines = FALSE, latex.input = FALSE, table.nr = FALSE)

## S4 method for signature 'VegsoupPartition'
write.verbatim(obj, file, select, absence = ".",
          sep = " ", pad = 1, abbreviate = TRUE, short.names = FALSE,
          rule, add.lines = FALSE, latex.input = FALSE, table.nr = FALSE)

## Arguments to functions

file

character. Path to a plain text file (text array).

## Arguments to function read.verbatim

`colnames`	character. String to be searched for in the header section and used to assign plot names. Note, do not use any characters defined for argument `replace` in the header section of the input file! See ‘Details’.
`layers`	list or character. If `layers` is a named list, it defines start and end indices to assign a layer to each species observation. `layers` can also be of mode character. If the length of the supplied vector agrees with the number of species in the data set, it assigns a value of layer to each species observation. If layers are present in the input file, argument layers can be a single character used to prune layers from taxa abbreviations. See ‘Examples’. For `read.verbatim`, separator used for trailing layer value (e.g. genus species@hl). Note, all layer assignment characters have to align vertically!
`replace`	logical. Vector of characters to be replaced by blanks. Note, both the hyphen and dash characters are defined in default set. See ‘Details’.
`schema`	character. Vector of characters of length 3 to be used to split character strings. Element one gives the separator to prune the plot identifier or a species. Element two is the separator in the taxon and abundance value part. Element three is used to prune abundance values associated with each taxon or plot. See ‘Details’.
`species.first`	logical. If `TRUE` each row constitutes of species at one or several plots. If `FALSE` each row gives plots with species occurrences (and abundances).
`abundance.first`	logical. If `TRUE` abundance value precede taxon names, if `FALSE` abundance values are given after taxon names. If `NA` no abundance is available and the value of argument `abundance` is assigned.
`multiple`	logical. If `TRUE` several obersations per species or plot (`schema[ 2 ]`). If `FALSE` a single obersavtion per line. `schema[ 3 ]` specifies the split and `schema[ 2 ]` is not uased at all.
`species.only`	logical. Just return a vector of species extracted from the data set.
`vertical`	logical. Set to `FALSE` if numbers are not stacked vertically.
`verbose`	Print diagnostic messages.

## Arguments to function read.verbatim.append

`x`	`VegsoupVerbatim` object. Usually created by calls to function `read.verbatim`.
`mode`	character. Defines how the species occurrences in rows are organized. See ‘Details’.
`sep`	character. Separator between successive species. Defaults to `","`.
`abundance`	For `mode = "species"` a single `character` recycled for all observations and setting a dummy cover value if no abundance for a species is supplied in the data source. Defaults to `"+"`. For `mode = "plots"` either `TRUE` or `FALSE` and specifying if the character found after the species name should be treated as a cover value. See ‘Details’. If `abundance = FALSE` the defaults of `mode = "species"` apply with no option to override it, but setting `abundance` to any value (e.g. `"+"`) will use that value.

## Arguments for S4 method write.verbatim for signature 'Vegsoup' and 'VegsoupPartition'

`obj`	`Vegsoup` object.
`select`	character. Vector matching `names(obj)` or indices to columns in `sites(obj)` to select information to be incorporated in the header section. Beware, this only makes sense for numeric (e.g. elevation) or short string variables (e.g. slope aspect)!
`absence`	Character to code absences, defaults to `"."`.
`collapse`	Character to be used as separator in columns of abundances (plots).
`pad`	Integer specifying the number of blanks to add to taxon names (right side) and layers string (both sides), defaults to 1.
`abbreviate`	Truncate abundance values to width 1 using `abbreviate(x, minlength = 1, strict = TRUE)`.
`rule`	Integer vector of length equal to the number of plots `nrow(obj)`. The lengths of runs of equal values in this vector are used to insert vertical rule mimicked by the pipe gylph (`'\|'`).
`short.names`	Use taxon abbreviation instead of long scientific names. Dots are converted to blanks.
`add.lines`	Add blank lines separating header section.
`latex.input`	Warp output in LaTex verbatim environment.
`table.nr`	Create a short running number.

The data format of the input file equals in its fundamentals the output format of traditional software such as ‘TWINSPAN’ and users are assumed to be familiar with such data structures. Also editing plain text files where matrix like data structures are achieved using monospaced fonts is presupposed. In order to create a fully valid input file a text editor allowing hidden characters to be displayed is essential. At least the editor must be capable of displaying line end special characters (see below).

In general, the data layout consists of species abundances measured on plots where species performance is coded nothing but as a single character. Species or variables are in rows and plots are columns. As a consequence, using a monospaced font each observation on a plot aligns to a column composed of single characters. Only in the header section (see 1. below) values of width longer than one are supported. Adopting the previous logic, these figures have to be aligned vertically. For example, a value of ‘1000’ needs for lines, one line for each digit. See ‘Examples’ for a valid input file.

Horizontal and vertical table rules are often found in printed tables. When digitizing such a table the it is often handy to use some special characters to align with the original table strokes. Because of their non informative properties these characters are discarded when parsing the input file and all characters supplied with replace are replaced with blanks. Care has to be taken with hyphens and dashes (see defaults of replace). R treats the minus sign as a dash ("\uad"). Warning, problems might arise on non UTF-8 platforms.

The S3 function read.verbatim() makes the following additional assumptions about the input file format (not a file format in the strict sense, but a set of conventions).

The table header and table body must be enclosed within a pair of keywords. Lines giving the keywords ‘BEGIN HEAD’ and ‘END HEAD’ must be present at the beginning and end, respectively, of the table header. This block of data has to by enclosed with ‘'BEGIN TABLE'’ and ‘END TABLE’ to identify the main table structure.
The data blocks (HEADER and TABLE) can have empty lines and/or columns of spaces separating plots.
The width of the table (number of monospaced font characters) should align at right side. It is crucial to ensure that all line end characters align vertically!
All species absences have to be coded with a dot (‘.’). This is often found in published tables. At least when digitizing using dots for absences helps in aligning columns.
Given that absences are coded as ‘.’ each cell of the community table has a value. This also ensures that the left margin of the community table can be filtered automatically.
The width (number of characters) of the header and the table body data blocks has to perfectly match. The values found in the header (possibly of length longer than one and aligned vertically) exactly corresponds the species abundances on the same plot. In other words, they align to the same column in a monospaced font layout.
Tab characters ("\t") are not allowed in the input file!
If layer assignment is supplied the character(s) has to align vertically.
Empty lines, if present, are not allowed to have as many spaces as the whole text block.

Plot identifiers can be assigned using one of the attributes on the resulting object or by supplying an argument colnames. Function header can be used to retrieve the header part stared as additional object matrix attributes.

It is often the case that rare species are only given in the table footer but not in the main table. The function read.verbatim.append() takes an object created by calls to read.verbatim() and adds species given in argument file. The function currently accepts two formats specified by argument mode. Input files in mode = "species" requires a simple text file, where each row corresponds to a unique species, followed by a colon (‘:’) and subsequent strings matching colnames(x) and separated with commas (see argument collapse). All spaces found after the colon will be discarded. See ‘Examples’ for a valid input file in species mode. Files applicable to mode = "plots" have plots (must match colnames(x)) in each row, followed by a colon. Note, if the plot identifier can be interpreted as numeric leading zeros are stripped of. This is also the behavior of read.verbatim. The part after the colon has to be an enumeration of taxa for a respective plot; again separated by argument collapse. If argument abundance is TRUE the single last character after the species name is treated as an abundance value. mode = "layers" allows also a layer to be parsed.

Function castFooter is a stand-alone utility function to cast a three column matrix from plain text consisting of (1) plot, (2) species and (3) abundance values character strings given on a single row of the input file. Suppose the following example: "30: 1 Empetrum nigrum ssp. hermaphroditum, + Arabis ciliata, + Luzula campestris". In this case, 30: gives the plot identifier, it is separated with a colon from the listing of species with their associated abundance values. Each species and cover combination is further separated with a comma. Finally, the abundance value can be unscrambled by searching for the first space. The schema argument must be used to define the particular patterning of the input file. For this particular case it is c(":", ",", " "). Unfortunately there are no standards and there is an overwhelming number of variants in the literature. The main practical effect of the castFooter function is to transpose the data to a matrix representation for further manipulation.

Finally the S4 method write.verbatim() creates output honoring the set of definitions given above. See ‘Arguments’ for possible customization of the output file. If called on an object of class "VegsoupPartition" argument rule is taken form partitioning(obj). To get partition aligned properly it might be necessary to apply seriation(obj) beforehand.

The print method for objects of class VegsoupVerbatim uses as.data.frame(x) as a means to get rid of quotes and to provide clean screen output.

read.verbatim and read.verbatim.append return an S3 object of class VegsoupVerbatim. Basically a matrix of mode character with attributes giving the data in the header section of the input file. Species are rownames. If argument colnames is supplied the returned matrix will hold meaningful colnames.

write.verbatim writes a file to disk and invisibly returns the vector of characters written to the file.

castFooter returns an object of class Species.

It is hard to avoid typos when editing mono spaced table structures, especially, when the are a lot of rows and columns. The human eye easily gets impaired by the overwhelming number of values. The read.verbatim function will report any instance that is not valid and diagnostic messages are printed to the console to aid the user in correcting the input file. The probability of typos increases if there are columns of spaces separating columns of data.

Leading zeros in colnames if it can be coerced to as.numeric, as well as in the header block as a whole, are not preserved but are subject to a call to type.convert!

Roland Kaiser

The demonstration data set used in the example is taken from:

Erschbamer, B. (1992). Zwei neue Gesellschaften mit Krummseggen (Carex curvula ssp. rosae, Carex curvula ssp. curvula) aus den Alpen – ein Beitrag zur Klärung eines alten ökologischen Rätsels. Phytocoenologia, 21:91-116.

The are three text files. "Erschbamer1992Tab4.txt" has the main vegetation table plus some header rows. Species names were taxonomically interpreted to a match a reference list supplied as "Erschbamer1992Taxonomy.txt". "Erschbamer1992Tab4Tablefooter.txt" has a list of rare species, those with frequency lower than 4. "Erschbamer1992Tab4Locations.txt" has geographic coordinates interpreted from the data source.

stackSpecies, stackSites

file <- system.file("extdata", "Erschbamer1992Tab4.txt",
                    package = "vegsoup")

# read OCR scan
x1 <- read.verbatim(file, colnames = "Aufnahme Nr.", verbose = TRUE)
class(x1)
head(x1)
dim(x1)

# extract header (sites) data from VegsoupVerbatim object
y1 <- header(x1)
# translate and groome names
# header() returns also plot names as rownames
names(y1) <- c("plot", "altitude", "aspect", "slope", "cover", "pH", "block")
# promote to Sites object
y1 <- stackSites(y1, schema = "plot")
y1

# promote table body to Species object
x1 <- species(x1)
richness(x1)

# get species from table footer
# a listing of species not covered by the main table and plot where they occur in
# the source does not supply any abundance values, we assume '+'
file <- system.file("extdata", "Erschbamer1992Tab4Tablefooter.txt",
                    package = "vegsoup")
x2 <- castFooter(file, species.first = TRUE, abundance.first = NA,
                 abundance = "+")
x2$plot <- sprintf("%03d", as.numeric(x2$plot))
richness(x2)
# bind species in table footer with main table
X <- bind(x1, x2)
X
richness(X)

#   additional sites data including coordinates as a tab delimited file
file <- system.file("extdata", "Erschbamer1992Tab4Locations.txt",
                    package = "vegsoup")
y2 <- read.delim(file, colClasses = "character")
head(y2)
# add leading zeros
y2$nr <- sprintf("%03d", as.numeric(y2$nr))
# promote to class "Sites"
y2 <- stackSites(y2, schema = "nr")
y2

#	bind with sites data from table header
Y <- bind(y1, y2)

# taxonomic reference list
file <- system.file("extdata", "Erschbamer1992Taxonomy.txt",
                    package = "vegsoup")
Z <- read.delim(file, colClasses = "character")
#   promote to class Taxonomy
Z <- taxonomy(Z)

# groome abundance scale codes to fit the standard
# of the extended Braun-Blanquet scale used in the origional publication
X$cov <- gsub("m", "2m", X$cov)
X$cov <- gsub("a", "2a", X$cov)
X$cov <- gsub("b", "2b", X$cov)

# create Vegsoup object
( x <- Vegsoup(X, Y, Z, "braun.blanquet") )

# plot of the dissimilarity matrix
coldiss(x, diag = T)

# see if we can reproduce the grouping in the original table
prt1 <- VegsoupPartition(x, 3, "wards")
prt0 <- VegsoupPartition(x, 3, clustering = "block")

# block/group 3 is ambigously assigned
confusion(prt1, prt0)

# write object of class Vegsoup
txt <- write.verbatim(seriation(x), file = tempfile(),
                      select = c(8,1,3))
txt[1:30] # resize console window