In meztez/nlpraqc19: R NLP R@Quebec 2019

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, eval = TRUE)

Survol

L'objectif de cet atelier et de vous présenter différentes méthodes pour intégrer des documents dans un contexte d'apprentissage machine. Vous devrez ensuite appliquer ces méthodes pour entraîner un modèle de classification binaire.

À la base, un document doit être transformé en une représentation numérique pour pouvoir être utilisé dans un algorithme d'entraînement.

La technique utilisée pour effectuer cette transformation a un grand impact sur les résultats. C'est ce que nous explorerons aujourd'hui.

Agenda

Base
- Comment importer des documents (tm, rvest, odbc, jsonlite)
- Un petit mot sur l'encodage
- Manipuler des chaînes de caractères (grep it like its hot)
Transformations
- Orthographe (hunspell)
- Stopwords (tm)
- Stemming (SnowballC)
- Collocations
- Contextes négatifs
- Matrice de fréquence des termes dans un documents
Apprentissage machine

Atelier

Base

Comment importer des documents

À partir d'une source de données odbc (odbc)

library(odbc)
docs <- dbGetQuery(dbConnect(odbc(), "DSN"), "SELECT TEXT FROM COMMENTS")

À partir d'internet (rvest)

library(rvest)
content <- read_html('https://old.reddit.com/r/Quebec/')
submissions <- content %>% html_nodes('a.title') %>% html_text
links <- content %>% html_nodes('a.title') %>% html_attr("href")
rqclatest <- data.frame("submissions" = submissions, "links" = links)
rqclatest

Référence :
* https://datascienceplus.com/building-a-hacker-news-scraper-with-8-lines-of-r-code-using-rvest-library/

À partir de documents pdf ou d'images (tesseract, tm)

library(tesseract)
vignette("intro", "tesseract")

Références :
https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/ https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e

À partir d'un archive de fichiers json

library(jsonlite)

archive <- "../extract_json_r_20190412.zip"
unzip(archive, overwrite = TRUE, junkpaths = TRUE, exdir = "../json")
files <- dir("../json", full.names = TRUE, recursive = FALSE)
docs <- sapply(files,
                function(file) {
                  content <- readLines(file, encoding = "latin1", warn = FALSE)
                  content <- iconv(content, from = "ISO8859-1", to = "UTF-8")
                  # content <- gsub("\\\\", "\\\\u005C", content)
                  # content <- gsub("\u0008", "\\\\u0008", content)
                  # content <- gsub("\u0009", "\\\\u0009", content)
                  # content <- gsub("\u000A", "\\\\u000A", content)
                  # content <- gsub("\u000C", "\\\\u000C", content)
                  # content <- gsub("\u000D", "\\\\u000D", content)
                  decoded <- try(fromJSON(content))
                  if (class(decoded) == "try-error") {
                    print(file)
                    return()
                  } else {
                    return(decoded$text-content)
                  }
                })

À partir d'un data déjà créé pour l'atelier d'aujourd'hui

library(data.table)

utils::data(collisions, package = "nlpraqc19")
docs <- nlpraqc19::collisions
docs[sample(1:nrow(docs), 5)]

Un petit mot sur l'encodage

Si vous voyez apparaître des symboles � inattendus dans vos chaînes de caractères, c'est probablement parce qu'il y a un problème avec l'encodage.

Pour faire une histoire courte, vos programmes utilisent une table d'encodage pour déterminer la correspondance entre la représentation binaire et la représentation symbolique des caractères.

Donc pour s'assurer d'avoir la bonne représentation, il faut effectuer la lecture ou l'écriture des données avec le bon encodage.

# Définir l'encodage au niveau d'une connexion
dbConnect(odbc(), encoding = "latin1")

# Convertir l'encodage d'une chaîne de caractères
iconv(text, from = "ISO8859-1", to = "UTF-8")

Référence :
https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/

Manipuler des chaînes de caractères (grep it like its hot)

Il y a quelques outils de base qui sont indispensables à la manipulation de chaînes de caractères.

grep, gsub et expressions régulières

Trouver les 10 premiers commentaires qui contiennent le mot "chat".

grep("chat", collisions$QUEST_CICH_COMNT)[1:3]

Ça nous donne l'indice mais si on veut la valeur.

grep("chat", collisions$QUEST_CICH_COMNT, value = TRUE)[1:3]

C'est pas tout à fait ça, on veut le mot chat donc il faut mieux définir notre expression de recherche. Ajoutons l'option d'ignorer minuscule masjucule.

grep("\\bchat\\b", collisions$QUEST_CICH_COMNT, value = TRUE, ignore.case = TRUE)[1:3]

Les expressions régulières sont très puissantes. Elles permettent de trouver rapidement de l'information. Maîtriser les et dominer les mondes... ou presque.

?regex

Vous pouvez expérimenter avec différentes expressions pour identifier des cas. C'est comme un où est Charlie.

Trouver quelqu'un qui est rentré dans un arbre.

txt <- collisions$QUEST_CICH_COMNT
grep("\\brentr", txt[grep("\\barbre\\b",txt)], value = TRUE, ignore.case = TRUE)[11]

Bonjour la police

grep("voiture de police|char de police", txt, value = TRUE, ignore.case = TRUE)[1:3]

Est-ce que je peux avoir votre code postal?

grep("[a-z][0-9][a-z][ -]?[[:digit:]][[:alpha:]][[:digit:]]", txt, value = TRUE, ignore.case = TRUE)[1:3]

La fonction gsub sert à effectuer des remplacements. Elle pourrait servir à retirer les informations sensibles des champs textes comme les numéros de téléphones.

phonepattern <- "\\b\\(?([0-9]{3})\\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})\\b"
findtel <- grep(phonepattern, txt, value = TRUE, ignore.case = TRUE)[1]
data.frame("origine" = findtel, "remplacement" = gsub(phonepattern, "", findtel))

Les fonctions gregexpr et regmatches permettent d'extraire les valeurs directement des chaînes de caractères.

parsed <- gregexpr(phonepattern, txt)
matches <- regmatches(txt, parsed)
unlist(unlist(matches))[1:3]

Autres fonctions utiles

Les fonctions substr, strsplit, paste, paste0, tolower, toupper sont également très pratiques.

phrase <- sample(txt, 1)
substr(phrase, max(1, nchar(phrase) - 10), nchar(phrase))

strsplit(phrase, "[^-'[:alnum:]]+")

paste("sors avec", unlist(strsplit(phrase, "[^[:alnum:]]+")))

paste0(21:31, collapse = ",")

toupper(phrase)

Transformations

Orthographes

Le package hunspell permet de valider l'orthographe de textes.

Charger hunspell

library(hunspell)

Télécharger le dictionnaire français le plus à jour.

tempf <- tempfile()
tempd <- tempdir()
download.file("http://grammalecte.net/download/fr/hunspell-french-dictionaries-v6.4.1.zip", tempf)
dicts <- grep("tes\\.aff?|tes\\.dic?", unzip(tempf, list = TRUE)$Name, value = TRUE)
unzip(tempf, files = dicts, overwrite = TRUE, junkpaths = TRUE, exdir = tempd)

custom_words <- c("b2","b3","tp","vh","boul","v\U00e9h","faq","vt","faq20",
                 "veh","mtl","43a","qc","rdp","dir","43ae","cond","domms",
                 "20a","iga","aut","ins","ste","st","blv","domm","pcq","bags",
                 "43e","rp","coord","faq43a","berpa","13c","faq27","str")
dict_fr <- dictionary(paste0(tempd, "\\", grep("\\.dic", dicts, value = TRUE)), add_words = custom_words)
dict_en <- dictionary("en_CA", add_words = custom_words)

Maintenant regardons des phrases dans nos données.

phrases <- sample(txt, 5)
words <- hunspell_parse(phrases, dict = dict_fr)
correct <- lapply(words, hunspell_check, dict_fr)
sapply(correct, all)

Trouvez les mots mal orthographiés ou non reconnus.

bad <- lapply(1:length(words),function(x) {words[[x]][!correct[[x]]]})
bad

On peut aussi passer par hunspell directement.

hunspell(phrases, dict = dict_fr)

Qu'est-ce que le dictionaire propose?

lapply(bad, function(x) {
  suggest <- hunspell_suggest(x, dict = dict_fr)
  names(suggest) <- x
  suggest
  })

Tentative d'autocorrection. Si plus de 50% des mots sont mal orthographiés, on va assumer que c'est de l'anglais. Pas efficace mais honnête.

autocorrect <- function(txt, dict, alt_dict) {
  parsed <- hunspell_parse(txt, dict = dict)
  checked <- lapply(parsed, hunspell_check, dict = dict)
  suggested <- lapply(1:length(parsed), function(i) {
    if (!all(checked[[i]])) {
      if (sum(!checked[[i]])/length(checked[[i]]) > 0.5) {
        checked[[i]] <- hunspell_check(parsed[[i]], alt_dict)
        dict_suggest <- alt_dict
      } else {
        dict_suggest <- dict
      }
      suggested <- unlist(lapply(hunspell_suggest(parsed[[i]][!checked[[i]]], dict = dict_suggest), `[`, 1))
      parsed[[i]][!checked[[i]]] <- ifelse(is.na(suggested), parsed[[i]][!checked[[i]]], suggested)
    }
    parsed[[i]]
  })
  return(unlist(lapply(suggested, paste, collapse = " ")))
}

On essaie.

autocorrect(phrases, dict_fr, dict_en)
autocorrect("hi, grandma? can u come pyck me up from my rap batttle? it's over. no, i lost. he saw u droop me off & did a prety devastating rhyme about it", dict_fr, dict_en)

La réalité de la vie c'est que les données ne sont pas toujours parfaite et que la correction de mots sans contexte c'est pas facile. Il n'y a pas encore d'intégration de grammalecte avec R, mais ça existe en python via reticulate. À explorer.

Racine des mots Optionel

Avec SnowballC. Des fois c'est utile pour réduire la taille d'un vocabulaire.

library(SnowballC)
lapply(strsplit(phrases, "[^-'[:alnum:]]+"), wordStem, language = "french")

Présentation du package text2vec

La vectorisation

On reprend les guides de http://text2vec.org mais avec nos données. Tout le crédit va à Dmitriy Selivanov, envoyez lui du love.

Le setup.

library(text2vec)
set.seed(3.1416)
model_data <- collisions[, .(QUEST_CICH_COMNT)]
model_data[, target := ifelse(collisions$QUEST_COLLISION_PERTETOT == "O", 1, 0)]
model_data[, id := .I]
all_ids <- model_data$id
train_ids <- sample(all_ids, 80000)
test_ids <- setdiff(all_ids, train_ids)
train <- model_data[train_ids]
test <- model_data[test_ids]

Représentation des documents dans l'espace vectoriel avec le vocabulaire. Pour l'instant on va utiliser un preprocessor de base.

prep_fun <- tolower
tok_fun <- word_tokenizer

it_train <- itoken(train$QUEST_CICH_COMNT, 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = train$id, 
             progressbar = FALSE)
vocab <- create_vocabulary(it_train)

Le vocabulaire c'est juste un data.frame avec les mots et des compteurs.

head(vocab)

La matrice de fréquence des termes dans des documents.

vectorizer <- vocab_vectorizer(vocab)
dtm_train <- create_dtm(it_train, vectorizer)
dim(dtm_train)
identical(rownames(dtm_train), as.character(train$id))

Le premier modèle. Optionel

library(glmnet)
glmnet_classifier <- cv.glmnet(x = dtm_train, y = train[["target"]], 
                              family = 'binomial', 
                              # L1 penalty
                              alpha = 1,
                              # interested in the area under ROC curve
                              type.measure = "auc",
                              # n-fold cross-validation
                              nfolds = 4,
                              # high value is less accurate, but has faster training
                              thresh = 1e-3,
                              # again lower number of iterations for faster training
                              maxit = 1e3)

Il était un fois un graphique.

plot(glmnet_classifier)

Mais c'est un AUC!

print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

On peut reprendre le même vectorizer pour regarder la performance sur la validation. Je veux pas voler de punch mais c'est comme GoT 8. Va falloir travailler plus fort!

it_test <- itoken(test$QUEST_CICH_COMNT, 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = test$id, 
             progressbar = FALSE)

dtm_test = create_dtm(it_test, vectorizer)

preds <- predict(glmnet_classifier, dtm_test, type = 'response')[,1]
glmnet:::auc(test$target, preds)

Réduire la taille du vocabulaire.

Quand on enlève les stopwords, on perds les contextes négatifs.

swe <- tm::stopwords("en")
swf <- tm::stopwords("fr")
stop_words <- c(swe, swf)
vocab <- create_vocabulary(it_train, stopwords = stop_words)

pruned_vocab <- prune_vocabulary(vocab, 
                                 term_count_min = 50, 
                                 doc_proportion_max = 0.95)
vectorizer <- vocab_vectorizer(pruned_vocab)
dtm_train  <- create_dtm(it_train, vectorizer)

dim(dtm_train)

dtm_test <- create_dtm(it_test, vectorizer)
dim(dtm_test)

Les n-grams ou comment fusionner des mots.

vocab <- create_vocabulary(it_train, ngram = c(1L, 2L), stopwords = stop_words)

vocab <- prune_vocabulary(vocab,
                          term_count_min = 50, 
                          doc_proportion_max = 0.95)

bigram_vectorizer <- vocab_vectorizer(vocab)

dtm_train <- create_dtm(it_train, bigram_vectorizer)

glmnet_classifier <- cv.glmnet(x = dtm_train, y = train[['target']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = 4,
                 thresh = 1e-3,
                 maxit = 1e3)

plot(glmnet_classifier)

print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

Dans notre cas, c'est pire. On ne reconnaît pas la différence entre J'ai rentré dedans et J'ai pas rentré dedans.

dtm_test <- create_dtm(it_test, bigram_vectorizer)
preds <- predict(glmnet_classifier, dtm_test, type = 'response')[,1]
glmnet:::auc(test$target, preds)

Hash hash baby

h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 14, ngram = c(1L, 2L))
dtm_train <- create_dtm(it_train, h_vectorizer)

glmnet_classifier <- cv.glmnet(x = dtm_train, y = train[['target']], 
                             family = 'binomial', 
                             alpha = 1,
                             type.measure = "auc",
                             nfolds = 5,
                             thresh = 1e-3,
                             maxit = 1e3)

plot(glmnet_classifier)

print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

dtm_test <- create_dtm(it_test, h_vectorizer)
preds <- predict(glmnet_classifier, dtm_test , type = 'response')[, 1]
glmnet:::auc(test$target, preds)

Transformation de base

Normalisation

dtm_train_l1_norm <- normalize(dtm_train, "l1")

TF-IDF

vocab <- create_vocabulary(it_train)
vectorizer <- vocab_vectorizer(vocab)
dtm_train <- create_dtm(it_train, vectorizer)

# define tfidf model
tfidf <- TfIdf$new()
# fit model to train data and transform train data with fitted model
dtm_train_tfidf <- fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!
# apply pre-trained tf-idf transformation to test data
dtm_test_tfidf <- tfidf$transform(create_dtm(it_test, vectorizer))

glmnet_classifier <- cv.glmnet(x = dtm_train_tfidf, y = train[['target']], 
                              family = 'binomial', 
                              alpha = 1,
                              type.measure = "auc",
                              nfolds = 4,
                              thresh = 1e-3,
                              maxit = 1e3)

plot(glmnet_classifier)

print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

preds <- predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[,1]
glmnet:::auc(test$target, preds)

GloVe embeddings

itk <- itoken(collisions$QUEST_CICH_COMNT, 
              preprocessor = tolower, 
              tokenizer = tok_fun, 
              ids = collisions[, id := .I]$id, 
              progressbar = FALSE)
vocab <- create_vocabulary(itk)

vocab <- prune_vocabulary(vocab, term_count_min = 10)

vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(itk, vectorizer, skip_grams_window = 7L)

Fits like a GloVe

glove <- GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 50)
wv_main <- glove$fit_transform(tcm, n_iter = 25, convergence_tol = 0.001)

dim(wv_main)

wv_context <- glove$components
dim(wv_context)

word_vectors <- wv_main + t(wv_context)

stationnement <- word_vectors["stationnement", , drop = FALSE]
cos_sim <- sim2(x = word_vectors, y = stationnement, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 50)

Collocations

model <- Collocations$new(collocation_count_min = 25)
model$fit(itk, n_iter = 3)

model$collocation_stat

On peut utliser les collocations pour identifier les contextes négatifs en créant un vocabulaire spécifique.

colloc <- Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, sep = "_")
?Collocations

clean_text <- function(txt) {
  txt <- tolower(txt)
  # identifier contexte negatif par remplacement par un seul identifiant
  neg <- c(
    "aren't", "aucn", "aucun", "aucune", "can't", "cannot", "couldn't", "didn't",
    "doesn't", "don't", "hadn't", "hasn't", "haven't", "isn't", "mustn't", "ni",
    "no", "non", "none", "not", "pas", "persone de", "sans", "shan't", "shouldn't",
    "wasn't", "weren't", "won't", "wouldn't"
  )
  txt <- gsub(paste0("\\b", neg, "\\b", collapse = "|"), "negctxt", txt)
  # enlever les mots poches, les stopwords
  swe <- tm::stopwords("en")
  swf <- tm::stopwords("fr")
  txt <- tm::removeWords(txt, c(swe, swf))
  # enlever les accents avec le code unicode pour etre platform independant
  o1 <- "\U0073\U007a\U00fe\U00e0\U00e1\U00e2\U00e3\U00e4\U00e5\U00e7\U00e8\U00e9\U00ea\U00eb\U00ec\U00ed\U00ee\U00ef\U00f0\U00f1\U00f2\U00f3\U00f4\U00f5\U00f6\U00f9\U00fa\U00fb\U00fc\U00fd"
  n1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
  txt <- chartr(o1, n1, txt)
  # enlever ce qui est pas a-z
  txt <- gsub("[[:punct:]]|[^[:graph:]]|[0-9]+", " ", txt)
  txt <- gsub("[ ]+", " ", txt)
  return(txt)
}

it <- itoken(model_data$QUEST_CICH_COMNT,
             preprocessor = clean_text,
             tokenizer = word_tokenizer,
             ids = 1:nrow(collisions),
             progressbar = FALSE)

colloc <- Collocations$new(collocation_count_min = 10, pmi_min = 5, gensim_min = 100, lfmd_min = -25, llr_min = 1000)
colloc$fit(it, n_iter = 1)

# Construire les phrases avec la collocation
it_wc <- colloc$transform(it)
vocab_wc <- create_vocabulary(it_wc)
vocab_wc <- prune_vocabulary(vocab_wc, term_count_min = 50)
vtz <- vocab_vectorizer(vocab_wc)

# Utiliser la frequence relative des termes pour selectionner les mots les plus discriminants
dtm <- create_dtm(it_wc, vtz)
rfo <- Matrix::colSums(dtm[model_data$target == 1, ]) / sum(model_data$target == 1)
rfn <- pmax(Matrix::colSums(dtm[model_data$target == 0, ]), 1) / sum(model_data$target == 0)
rf <- sort(rfo / rfn, decreasing = TRUE)
cutoff <- 1.5
words <- names(rf)[rf >= cutoff | rf <= 1 / cutoff]
c(head(words, 10), tail(words, 10))

On rajoute synthétiquement des collocations négatives

# Reconstruire le vocabulaire et check les termes neg qui existe dans le corpus pour creer le vocab neg
bvocab <- create_vocabulary(itoken(words))
bvocabneg <- create_vocabulary(it_wc, ngram = c(2, 2))
bvocabneg <- bvocabneg[grep("n\U00e9gative", bvocabneg$term), ]
bvocabneg <- bvocabneg[bvocabneg$term %in% paste0("n\U00e9gative_", words), ]

# Vecteurs de mots normaux et négatifs
wordsvec <- sort(c(bvocab$term, bvocabneg$term))

# Vocabulaire de base avec le vectorizer et les termes négatifs
bvc <- create_vocabulary(itoken(wordsvec, progressbar = FALSE))
bvc <- bvc[order(bvc$term), ]
bvtz <- vocab_vectorizer(bvc)

# Synthetic colloc to take care of neg words
txt <- gsub("_", " ", bvc$term)
itsynt <- itoken(txt, progressbar = FALSE)
syntcolloc <- Collocations$new(collocation_count_min = 1, pmi_min = -Inf, gensim_min = 0)
syntcolloc$fit(itsynt, n_iter = 2)

it_train <- itoken(train$QUEST_CICH_COMNT,
                   preprocessor = clean_text,
                   tokenizer = word_tokenizer,
                   progressbar = FALSE)
modmtrx_train <- create_dtm(syntcolloc$transform(it_train), bvtz)
it_test <- itoken(test$QUEST_CICH_COMNT,
                  preprocessor = clean_text,
                  tokenizer = word_tokenizer,
                  progressbar = FALSE)
modmtrx_test <- create_dtm(syntcolloc$transform(it_test), bvtz)

Modélisation de sujets

LSA

tfidf <- TfIdf$new()
lsa <- LSA$new(n_topics = 4)

doc_embeddings <- lsa$fit_transform(tfidf$fit_transform(modmtrx_train))

dim(doc_embeddings)

dim(lsa$components)

new_doc_embeddings <- lsa$transform(tfidf$transform(modmtrx_test))
dim(new_doc_embeddings)

LDA

dtm <- create_dtm(syntcolloc$transform(it_train), bvtz, type = "dgTMatrix")

lda_model <- LDA$new(n_topics = 4, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = dtm, n_iter = 1000, convergence_tol = 0.001, n_check_convergence = 25, progressbar = FALSE)

barplot(doc_topic_distr[1, ], xlab = "topic", ylab = "proportion", ylim = c(0, 1), names.arg = 1:ncol(doc_topic_distr))

Dans notre cas, le sujet est assez ciblé.

lda_model$get_top_words(n = 10, topic_number = c(1L, 2L, 3L, 4L), lambda = 1)

lda_model$get_top_words(n = 10, topic_number = c(1L, 2L, 3L, 4L), lambda = 0.2)

new_dtm <- create_dtm(syntcolloc$transform(it_test), bvtz, type = "dgTMatrix")
new_doc_topic_distr <- lda_model$transform(new_dtm)

perplexity(new_dtm, topic_word_distribution = lda_model$topic_word_distribution, doc_topic_distribution = new_doc_topic_distr)

Besoin du package LDAvis donc optionel

lda_model$plot()

Apprentissage machine

Définir les indices d'entraînement et de validation.

n <- nrow(collisions)
set.seed(8675309)

Charger la librairie de boosting de votre choix (xgboost, lightgbm, catboost).

library(xgboost)

Préparer les matrices pour l'entraînement en réutilisant notre matrice maison.

mtrx_train <- xgb.DMatrix(modmtrx_train, label = train$target)
mtrx_test <- xgb.DMatrix(modmtrx_test, label = test$target)

Entraîner le modèle.

set.seed(8675309)
xgbmodel <- xgb.train(data = mtrx_train,
                      watchlist = list(eval = mtrx_test, train = mtrx_train),
                      nrounds = 200,
                      objective = "binary:logistic",
                      booster = "gbtree",
                      early_stopping_rounds = 50,
                      print_every_n = 50,
                      max_depth = 15,
                      gamma = 5,
                      subsample = 0.5,
                      colsample_bytree = 1,
                      eta = 0.1)

Vérifier les performances.

preds <- predict(xgbmodel, mtrx_test)
obs <- test$target

recall <- function(p) {
  sapply(p, function(p) {
    prd <- as.numeric(preds > p)
    length(which(prd == 1 & obs == 1)) / length(which(obs == 1))
  })
}

precision <- function(p) {
  sapply(p, function(p) {
    prd <- as.numeric(preds > p)
    length(which(prd == 1 & obs == 1)) / length(which(prd == 1))
  })
}

x <- recall((1:250) / 250)
y <- precision((1:250) / 250)

plot(x, y, main = "Validation", xlab = "Recall", ylab = "Precision")

meztez/nlpraqc19 documentation built on June 4, 2019, 5:20 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

meztez/nlpraqc19
R NLP R@Quebec 2019

In meztez/nlpraqc19: R NLP R@Quebec 2019

Survol

Agenda

Atelier

Base

Comment importer des documents

À partir d'une source de données odbc (odbc)

À partir d'internet (rvest)

À partir de documents pdf ou d'images (tesseract, tm)

À partir d'un archive de fichiers json

À partir d'un data déjà créé pour l'atelier d'aujourd'hui

Un petit mot sur l'encodage

Manipuler des chaînes de caractères (grep it like its hot)

grep, gsub et expressions régulières

Autres fonctions utiles

Transformations

Orthographes

Racine des mots Optionel

Présentation du package text2vec

La vectorisation

GloVe embeddings

Collocations

Modélisation de sujets

Apprentissage machine

R Package Documentation

Browse R Packages

We want your feedback!

meztez/nlpraqc19 R NLP R@Quebec 2019

In meztez/nlpraqc19: R NLP R@Quebec 2019

Survol

Agenda

Atelier

Base

Comment importer des documents

À partir d'une source de données odbc (odbc)

À partir d'internet (rvest)

À partir de documents pdf ou d'images (tesseract, tm)

À partir d'un archive de fichiers json

À partir d'un data déjà créé pour l'atelier d'aujourd'hui

Un petit mot sur l'encodage

Manipuler des chaînes de caractères (grep it like its hot)

grep, gsub et expressions régulières

Autres fonctions utiles

Transformations

Orthographes

Racine des mots Optionel

Présentation du package text2vec

La vectorisation

GloVe embeddings

Collocations

Modélisation de sujets

Apprentissage machine

R Package Documentation

Browse R Packages

We want your feedback!

meztez/nlpraqc19
R NLP R@Quebec 2019