defined: Semantically Enriched Vectors"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The dataset package extends R's native data structures with machine-readable metadata. It follows a semantic early-binding approach, which means metadata is embedded as soon as the data is created, making datasets suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

defined works naturally with data structured according to tidy data principles (Wickham, 2014), where each variable is a column, each observation is a row, and each type of observational unit forms a table. It adds an additional semantic layer to individual vectors so their meaning is explicit, consistent, and machine-readable.

This vignette focuses specifically on the defined function, which you can use to create a semantically enriched vector. For details on semantically enriched data frames, see vignette("dataset_df", package = "dataset").

Purpose

The defined() function helps you create semantically rich labelled vectors that are easier to:

By attaching metadata at creation time, defined prevents the loss of context and meaning that often occurs when data is exchanged or archived. This approach supports the FAIR data principles (Findable, Accessible, Interoperable, Reusable) and facilitates integration into semantic web systems.

Getting started

library(dataset)
data("gdp")

We’ll start by wrapping a numeric GDP vector using defined().

gdp_1 <- defined(
  gdp$gdp,
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

The defined() class builds on labelled vectors by adding rich metadata:

This is particularly useful for reproducible research, standard-compliant data, or long-term interoperability. The class is implemented with R’s attributes() function, which guarantees wide compatibility. A defined vector can be used even in base R.

attributes(gdp_1)

From this output it is clear that the actual S3 class is called haven_labelled_defined, which clearly indicates the inheritance from haven_labelled (See: labelled::labelled). In the dataset summary headers the <defined> abbreviation is used.

Use the var_label(), var_unit() and var_concept() helper functions to set or retrieve metadata individually.

cat("Get the label only: ", var_label(gdp_1), "\n")
cat("Get the unit only: ", var_unit(gdp_1), "\n")
cat("Get the concept definition only: ", var_concept(gdp_1), "\n")
cat("All attributes:\n")

Printing and summary

The most frequently used vector methods, such as print or summary are implemented as expected:

print(gdp_1)
summary(gdp_1)

Handling ambiguity

If you try to concatenate a semantically under-specified new vector to an existing defined vector, you will get an intended error indicating that some attributes are not compatible. This prevents combining values that differ in meaning, such as GDP figures expressed in different currencies.

gdp_2 <- defined(
  c(2523.6, 2725.8, 3013.2),
  label = "Gross Domestic Product"
)

In the following example, gdp_1 and gdp_2 are not defined with the same level of precision.

c(gdp_1, gdp_2)
Error in vec_c():
! Can't combine ..1 <haven_labelled_defined> and ..2 <haven_labelled_defined>.
✖ Some attributes are incompatible.

To resolve this, you can add the missing attributes so that the vectors are semantically compatible.

Let's define better the GDP of the Faroe Islands:

var_unit(gdp_2) <- "CP_MEUR"
var_concept(gdp_2) <- "http://data.europa.eu/83i/aa/GDP"

Once the metadata matches, you can combine them.

new_gdp <- c(gdp_1, gdp_2)
summary(new_gdp)

Using namespaces for coded values

You can also define variables that store codes (like country codes) with a namespace that points to a human- and machine-readable definition of those codes. In statistical datasets, such attribute columns describe characteristics of the observations or the measured variables.

country <- defined(
  c("AD", "LI", "SM"),
  label = "Country name",
  concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
  namespace = "https://www.geonames.org/countries/$1/"
)

For example, the namespace definition above points to:

You can get or set the namespace of a defined vector with var_namespace().

var_namespace(country)

A URI such as http://publications.europa.eu/resource/authority/bna/c_6c2bb82d resolves to a machine-readable definition of geographical names.

The use of several defined vectors in a dataset_df object is explained in a separate vignette.

Basic Usage

You can create defined vectors from character values as well as numeric values. Methods like as_character() and as_numeric() let you coerce back to base R types while controlling what happens to the metadata.

countries <- defined(
  c("AD", "LI"),
  label = "Country code",
  namespace = "https://www.geonames.org/countries/$1/"
)

countries
as_character(countries)

Subsetting and coercion

Subsetting a defined vector works like subsetting any other vector.

gdp_1[1:2]
gdp_1[gdp_1 > 5000]
as.vector(gdp_1)
as.list(gdp_1)

Coerce to base R types

defined() vectors support a family of coercion helpers.
These methods avoid silent metadata loss and provide predictable conversions while respecting the underlying data type.

All coercion functions follow the same principles:

Below are the available coercion helpers.

Character coercion

as_character() converts the vector to a base R character vector.

If value labels are present, they become the character representation; otherwise, the underlying values are coerced.

as_character(country)
as_character(c(gdp_1, gdp_2))

Factor coercion

as_factor() converts the vector into a factor.

as_factor(country)

Numeric coercion

as_numeric() converts numeric defined vectors to base R numeric.

It throws an error if the underlying data is not numeric.

as_numeric(c(gdp_1, gdp_2))

Logical coercion

as_logical() converts defined vectors whose underlying data is logical (TRUE/FALSE).

Logical defined vectors cannot have value labels, ensuring consistent behaviour.

flag <- defined(c(TRUE, FALSE, TRUE), label = "Example flag")
as_logical(flag)

Date coercion

as_Date() converts a defined vector that inherits from Date back into a standard R Date vector.

Metadata is removed unless requested.

dates <- defined(
  as.Date(c("2020-01-01", "2020-01-02")),
  label = "Reference date"
)
as.Date(dates)

POSIXct coercion

as_POSIXct() converts a POSIXct-based defined vector back into a base R POSIXct object.

Time zones and the underlying numeric representation are always preserved.

times <- defined(
  as.POSIXct(c("2020-01-01 12:00:00", "2020-01-01 18:00:00")),
  label = "Timestamp"
)

times 

These coercion helpers ensure that defined vectors behave predictably in modelling, exporting, and data cleaning workflows — while still preserving semantic metadata when necessary.

Conclusion

The defined() function provides a lightweight yet powerful way to make vectors self-descriptive by attaching semantic metadata directly to them. By combining a variable label, unit of measurement, concept definition, and optional namespace, defined ensures that each vector's meaning is explicit, consistent, and machine-readable.

Because the metadata is embedded at creation time, it travels with the vector throughout your workflow — whether you are analysing, transforming, or exporting data.
This prevents context loss, supports the FAIR data principles (Findable, Accessible, Interoperable, Reusable), and facilitates integration with semantic web technologies.

defined vectors work seamlessly with the dataset_df class to create semantically enriched data frames where both datasets and their constituent variables carry rich, standardised metadata.
For more on creating semantically enriched datasets, see the dataset_df vignette.

For guidance on recording bibliographic metadata and citations, see the bibrecord vignette.



Try the dataset package in your browser

Any scripts or data that you put into this service are public.

dataset documentation built on Nov. 16, 2025, 5:06 p.m.