AutoComplete: 'AutoComplete' Asks for a dataframe, a vector of collumn...

Description Usage Arguments Examples

View source: R/Encontrar_candidatos_dataset_v1.R

Description

AutoComplete Asks for a dataframe, a vector of collumn indices and the goal collumn and returns the data frame with the values filled

Usage

1
AutoComplete(df, goal, maxi, repetitions, trigger = 1, ratio = 0.99)

Arguments

df

A dataframe with the missing values you wish to fill

goal

The collum with the missing values you wish to fill

maxi

What will be the length of possible combinations you will test example if 2 they will test up to all possible pairs of collums

repetitions

Measure of error, the bigger the less likely you will get the right prediction

trigger

When you pair all possible combination of tuples a percentage of them will show only once, trigger rejects the set if this percentage is higher than this value

ratio

Rejects collumns that the ratio of unique values to total values is higher than this value, primary keys have ratio equal to 1

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#The Auto Complete Function shall do the following
#Take a dataframe and a goal collumn to predict
#Tests every combination of vectors limited by a parameter length
#Use the best set to predict with accuracy given by MeanAccuracy function
#Then to run some experiments first lets build a dataframe
e=sample(1:5,1e4,replace=TRUE)
e1=sample(1:5,1e4,replace=TRUE)
e2=sample(1:5,1e4,replace=TRUE)
e=data.frame(e,e1,e2,paste(LETTERS[e],LETTERS[e1]),paste(LETTERS[e],LETTERS[e1],LETTERS[e2])   )
#Now we got a dataframe lets create a copy of it
ce=e
ce[sample(1:nrow(e),0.3*nrow(e)),5]=NA
#So 30 percent of the data is now missing
#Lets try to recover it then with autocomplete
ce1=AutoComplete(df=ce,goal=5,maxi=3,repetitions=nrow(ce),trigger=1)
#We can see how many values are still missing with NA_VALUES
print(NA_VALUES(ce1) )
#And check how many we got wrong by
print(sum(ce1[,5]!=e[,5]) )
# The process could be done for the 4 collum as well
ce=e
ce[sample(1:nrow(e),0.5*nrow(e)),4]=NA
#So 50 percent of the data is now missing
#Lets try to recover it then with autocomplete
ce1=AutoComplete(df=ce,goal=4,maxi=4,repetitions=nrow(ce),trigger=1)
#We can see how many values are still missing with NA_VALUES
print(NA_VALUES(ce1) )
#And check how many we got wrong by
print(sum(ce1[,4]!=e[,4]) )
#Here we can easily see e holds the original data
#ce1 is the recovered data

Example output

[1] 1
[1] 2
[1] 3
Warning message:
In names(df)[solution] == names(df_aux) :
  longer object length is not a multiple of shorter object length
                                          e 
                                          0 
                                         e1 
                                          0 
                                         e2 
                                          0 
             paste.LETTERS.e...LETTERS.e1.. 
                                          0 
paste.LETTERS.e...LETTERS.e1...LETTERS.e2.. 
                                          0 
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
                                          e 
                                          0 
                                         e1 
                                          0 
                                         e2 
                                          0 
             paste.LETTERS.e...LETTERS.e1.. 
                                          0 
paste.LETTERS.e...LETTERS.e1...LETTERS.e2.. 
                                          0 
[1] 0

cleanerR documentation built on May 2, 2019, 5:51 a.m.