knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 )
HandwriterRF has a pre-trained random forest and set of reference similarity scores that are the default for compare_documents()
and compare_writer_profiles()
. This tutorial shows you how to train your own random forest and create your own set of reference scores to use with these functions.
You need scanned handwriting samples saved as PNG images for training the random forest and making reference scores. The training set must include at least two samples from each writer so that the random forest can see examples of documents written by the same writer and examples of documents written by different writers.
The CSAFE Handwriting Database contains suitable handwriting samples that you may download for free if you don't have your own samples.
Place handwriting samples that you will use to train a random forest in a folder. The first step is to estimate a writer profile from each handwriting sample. We do this with handwriter::get_writer_profiles()
. Behind the scenes, handwriter::get_writer_profiles()
performs the following steps for each sample:
handwriter::processDocument()
.handwriter::make_clustering_template()
. By default, handwriter::get_writer_profiles()
uses the cluster template templateK40
included with handwriter. You may create your own cluster template if you prefer.handwriter::get_cluster_fill_rates()
. The cluster fill rates serve as an estimate of a writer profile for the writer of the document.Load handwriter and handwriterRF.
library(handwriter) library(handwriterRF)
Calculate writer profiles for the training samples with templateK40
. The output is a dataframe.
profiles <- handwriter::get_writer_profiles( input_dir = "path/to/training/samples/folder", measure = "rates", num_cores = 1, template = handwriter::templateK40, output_dir = "path/to/output/folder" )
Now that we have writer profiles, we can train a random forest. train_rf()
performs the following steps:
?train_rf
for more information about these measures.When running train_rf()
you have a several choices to make:
ntrees = 200
produced good results.output_dir
argument, the random forest will be returned but not saved to your computer.downsample_diff_pairs = TRUE
. This randomly samples the different writers distances to equal the number of same writer distances.rf <- train_rf( df = profiles, ntrees = 200, distance_measures = c("abs", "man", "euc", "max", "cos"), output_dir = "path/to/output/folder", downsample_diff_pairs = TRUE )
If you would like to train a series of random forests with lapply
or a for loop, use the run number and output directory arguments. The run number is added to the file name when the random forest is saved, so that subsequent random forests are not saved over the previous ones.
for (i in 1:10) { rf <- train_rf( df = profiles, ntrees = 200, distance_measures = c("abs", "man"), output_dir = "path/to/output/folder", run_number = i, downsample_diff_pairs = TRUE ) }
The functions compare_documents()
and compare_writer_profiles()
either return a similarity score or a score-based likelihood. Both express how similar or not two handwriting samples are to each other.
The score-based likelihood ratio (SLR) builds upon the observed similarity score by comparing it to reference same writer and different writers similarity scores. The SLR is the ratio of the likelihood of observing the similarity score if the samples where written by the same writer to the likelihood of observing the similarity score if the samples where written by the different writers.
If compare_documents()
and compare_writer_profiles()
only return the similarity score, reference scores are not used. But if these functions calculate an SLR they need reference scores. HandwriterRF includes a set of reference score as ref_scores
for use with these functions, but you can also create your own set of reference scores.
Refer to the sections above to obtain suitable training samples and estimate writer profiles.
ref_profiles <- handwriter::get_writer_profiles( input_dir = "path/to/ref/samples/folder", measure = "rates", num_cores = 1, template = handwriter::templateK40, output_dir = "path/to/output/folder" ) rscores <- get_ref_scores(rforest = rf, df = ref_profiles)
We can plot the built-in reference scores in a way similar to a histogram. These scores range from 0 to 1, inclusive. The plot_scores()
function divides this range into bins and calculates the proportion of scores that fall into each bin. Normally, a histogram would show the count of scores in each bin. However, since there are many more different writers scores than same writer scores, the histogram for different writers scores dominates, making the same writer histogram hard to see. To fix this, we plot the proportion (rate) of scores in each bin instead of the raw frequency, which balances the two histograms and makes both more visible.
plot_scores(scores = ref_scores)
If we want to see how an observed score compares to the same writer and different writers scores, we use the obs_score
argument. For example, if the observed score is 0.2, we plot
plot_scores(scores = ref_scores, obs_score = 0.2)
You can also plot your own reference scores.
plot_scores(scores = rscores, obs_score = 0.2)
In this section, we will use the new random forest and reference scores to compare two handwritten documents. As before, the handwriting samples need to be scanned and saved as PNG files. Do not use samples or writers that were used to create the random forest or the reference scores, as this may bias the results.
First, compare the two documents with the default random forest and reference scores. As an example, we use two handwriting samples included in handwriterRF. The system.file()
function finds the location of the handwriterRF package on your computer. We use score_only = FALSE
to return an SLR.
sample1 <- system.file("extdata", "docs", "w0238_s01_pWOZ_r02.png", package = "handwriterRF") sample2 <- system.file("extdata", "docs", "w0238_s01_pWOZ_r03.png", package = "handwriterRF") df <- compare_documents( sample1, sample2, score_only = FALSE ) df
The SLR is greater than one, which means the similarity score is more like the reference same writer scores than the different writers scores. We plot the observed score with the reference scores.
plot_scores(scores = ref_scores, obs_score = df$score)
Next, compare the same documents with the new random forest and reference scores and plot the obeserved score.
df_new <- compare_documents( sample1, sample2, score_only = FALSE, rforest = rf, reference_scores = rscores ) df_new plot_scores(scores = rscores, obs_score = df_new$score)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.