soundexBR: Soundex Encoding For Portuguese BR

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Description

The soundexBR function will return a Census-like soundex code for a string given that the Brazilian Portuguese sound system. Each soundex code consists of 4 digits long: a letter and three numbers, such as “0-000” <capital letter><digit><digit><digit>. The integers are assigned to the remaining letters of the last name. They are, therefore, a refinement index based on the way a surname sounds rather than the way it is spelled. This function was firtly outlined to work beside RecordLinkage package. Nonetheless, soundex codes have been employed in many settings. See details bellow.

Usage

1
soundexBR(term)

Arguments

term

a list, a vector or a data frame with character strings.

Details

The soundex is a coded surname (last name) index based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SOUZA and SOUSA, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.

Value

A character vector or matrix with the same dimensions as term.

Note

This function is an adaptation of the US census soundex version. See in http://www.archives.gov/research/census/soundex.html

The genealogist Dick Eastman maintain a soundex calcualtor in his website at http://www.eogn.com/soundex/.

Author(s)

Daniel Marcelino

References

Borg, Andreas and Murat Sariyar. (2012) RecordLinkage: Record Linkage in R, R package version 0.4-1, http://CRAN.R-project.org/package=RecordLinkage.

Camargo Jr. and Coeli CM (2000) Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage. Cad. Saúde Pública, 16(2), Rio de Janeiro.

Daniel Marcelino (2010) Sobre dinheiro e eleições: um estudo dos gastos de campanha para o Congresso Nacional em 2002 e 2006.

See Also

soundexES, soundexFR.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# last name with Z
first <- 'João'
last <- 'Souza'
middle <-'Santos'
soundexBR(c(first, middle, last))

# with S, instead of Z
first <- 'João'
last <- 'Sousa'
soundexBR(c(first, last))

# Miscelania
c('João Souza', 'João Sousa', 'Joao dos Santos Souza',
'John Souza') -> names

soundexBR(names)

names <- c('Ana Karolina Kuhnen',
'Ana Carolina Kuhnen', 'Ana Karolina',
'Dilma Vana Rousseff', 'Dilma Rousef')
  
soundexBR(names)

# Example with RecordLinkage
#Some data:
mydata1 <- data.frame(
fname <- c('Ricardo','Maria','Tereza','Pedro','José', 'Germano'),
lname <- c('Cunha','Andrade','Silva','Soares','Silva','Lima'),
age <- c(67,89,78,65,68,67),
birth <- c(1945,1923,1934,1947,1944,1945),
date <- c(20120907,20120703,20120301,20120805,20121004,20121209) )


mydata2 <- data.frame(
fname <- c('Maria','Lúcia','Paulo','Marcos', 'Ricardo', 'Germanio'),
lname <- c('Andrade','Silva','Soares','Pereira','Cunha','Lima'),
age <- c(67,88,78,60,68,80),
birth <- c(1945,1924,1934,1952,1944,1932),
date <- c(20121208,20121103,20120302,20120105,20121004,20121209) )

# Must call RecordLinkage package

## Not run: pairs <- compare.linkage(mydata1, mydata2,
blockfld = list(c(1,2,4),c(1,2)),
phonetic <- c(1,2), phonfun = soundexBR, strcmp = FALSE,
strcmpfun <- jarowinkler, exclude=FALSE,identity1 = NA,
identity2 = NA, n_match <- NA, n_non_match = NA)
      
print(pairs)

editMatch(pairs)

# To access information in the object:  
weights <- epiWeights(pairs, e = 0.01, f = pairs$frequencies)
hist(weights$Wdata, plot = FALSE) # Plot TRUE
getPairs(pairs, max.weight = Inf, min.weight = -Inf)

## End(Not run)

SciencePo documentation built on May 2, 2019, 5:53 p.m.