# classify_occupation: Classify occupations In labourR: Classify Multilingual Labour Market Free-Text to Standardized Hierarchical Occupations

## Description

This function takes advantage of the hierarchical structure of the ESCO-ISCO mapping and matches multilingual free-text with the ESCO occupations vocabulary in order to map semi-structured vacancy data into the official ESCO-ISCO classification.

## Usage

  1 2 3 4 5 6 7 8 9 10 classify_occupation( corpus, id_col = "id", text_col = "text", lang = "en", num_leaves = 10, isco_level = 3, max_dist = 0.1, string_dist = NULL ) 

## Arguments

 corpus A data.frame or a data.table that contains the id and the text variables. id_col The name of the id variable. text_col The name of the text variable. lang The language that the text is in. num_leaves The number of occupations/neighbors that are kept when matching. isco_level The ISCO level of the suggested occupations. Can be either 1, 2, 3, 4 for ISCO occupations, or NULL that returns ESCO occupations. max_dist String distance used for fuzzy matching. The amatch function from the stringdist package is used. string_dist String dissimilarity measurement. Available string distance metrics: stringdist-metrics.

## Details

First, the input text is cleansed and tokenized. The tokens are then matched with the ESCO occupations vocabulary, created from the preferred and alternative labels of the occupations. They are joined with the tfidf weighted tokens of the ESCO occupations and the sum of the tf-idf score is used to retrieve the suggested ontologies. Technically speaking, the suggested ESCO occupations are retrieved by solving the optimization problem,

\arg\max_d≤ft\{\vec{u}_{binary}\cdot \vec{u}_d\right\}

where, \vec{u}_{binary} stands for the binary representation of a query to the ESCO-vocabulary space, while, \vec{u}_d is the ESCO occupation normalized vector generated by the tf-idf numerical statistic. If an ISCO level is specified, the k-nearest neighbors algorithm is used to determine the suggested occupation, classified by a plurality vote in the corresponding hierarchical level of its neighbors.

Before the suggestions are returned, the preferred label of each suggested occupation is added to the result, using the occupations_bundle and isco_occupations_bundle as look-up tables.

## Value

Either a data.table with the id, the preferred label and the suggested ESCO occupation URIs (num_leaves predictions for each id), or a data.table with the id, the preferred label and the suggested ISCO group of the inputted level (one for each id).

## References

M.P.J. van der Loo (2014). The stringdist package for approximate string matching. R Journal 6(1) pp 111-122.

Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M., & Steiner, S. (2017). Three Methods for Occupation Coding Based on Statistical Learning, Journal of Official Statistics, 33(1), 101-122.

Arthur Turrell, Bradley J. Speigner, Jyldyz Djumalieva, David Copple, James Thurgood (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings.

ESCO Service Platform - The ESCO Data Model documentation

## Examples

 1 2 3 4 5 6 7 8 9 corpus <- data.frame( id = 1:3, text = c( "Junior Architect Engineer", "Cashier at McDonald's", "Priest at St. Martin Catholic Church" ) ) classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5) 

labourR documentation built on July 18, 2020, 5:06 p.m.