Package to clean text by measuring the similarity between text. Score between 0 to 100 where 0 mean very different and 100 mean similar text.
Any input related to this package is very much appreciated.
devtools::install_github("dessyamirudin/similaRText")
This package have two function with below description
text_similarity_score(
input_text,
target_text,
space = TRUE,
ignore_case = TRUE,
score = 0
)
To understand the function, use help function ?text_similarity_score
Sample 1
What is the similarity between "South Korea" and "south korea"? (not case sensitive)
text_similarity_score("South Korea","south korea")
input_text target_text similarity_score
1 South Korea south korea 100
Sample 2
What is the similarity between "South Korea" and ("south korea","Indonesia")? (case sensitive)
text_similarity_score("South Korea",c("south korea","Indonesia"),ignore_case = FALSE)
input_text target_text similarity_score
1 South Korea south korea 90.91
2 South Korea Indonesia 50.00
text_similarity_id(
input_text,
space = FALSE,
ignore_case = TRUE,
score = 80,
eps = 0.15
)
To understand the function, use help function
?text_similarity_id
a. Grouping similar text into one id. Will be useful to give ID for person when the ID in the database is missing.
Sample 1
text_similarity_id(c("south korea","Indonesia","South Korea"))
input_text id
1 south korea 1
2 South Korea 1
3 Indonesia 2
Sample 2
text_similarity_id(c("Budi S","Budi Susilo","Kadir"),score=70)
input_text id
1 Budi S 1
2 Budi Susilo 1
3 Kadir 2
Sample data is downloaded from Kaggle - Pakistan Intelectual Capital This data contain the list of Professor from Pakistan
data("sample_data")
To understand the data, use help function
?sample_data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.