- Home
- CRAN
**dprep**: Data Pre-Processing and Visualization Functions for Classification**clean**: Dataset's cleaning

# Dataset's cleaning

### Description

A function to eliminate rows and columns that have a percentage of missing values greater than the allowed tolerance.

### Usage

1 |

### Arguments

`w` |
the dataset to be examined and cleaned |

`tol.col` |
maximum ratio of missing values allowed in columns. The default value is 0.5. Columns with a larger ratio of missing will be eliminated unless they are known to be relevant attributes. |

`tol.row` |
maximum ratio of missing values allowed in rows. The default value is 0.3. Rows with a ratio of missing that is larger that the established tolerance will be eliminated. |

`name` |
name of the dataset to be used for the optional report |

### Details

This function can create an optional report on the cleaning process if the comment symbols are removed from the last lines of code. The report is returned to the workspace, where it can be reexamined as needed. The report object's name begins with: Clean.rep.

### Value

`w` |
the original dataset, with missing values that were in relevant variables imputed |

### Author(s)

Caroline Rodriguez

### References

Acuna, E. and Rodriguez, C. (2004). The treatment of missing values and its effect in the classifier accuracy. In D. Banks, L. House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classification, Clustering and Data Mining Applications. Springer-Verlag Berlin-Heidelberg, 639-648.

### See Also

`ce.impute`

### Examples

1 2 3 |

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker. Vote for new features on Trello.

- acugow: Gower distance from a vector to a matrix
- acugow: Gower distance from a vector to a matrix
- arboleje: Predicting a bank's decision to give a loan for buying a car.
- arboleje: Predicting a bank's decision to give a loan for buying a car.
- arboleje1: Predicting a bank's decision to give a loan for buying a car.
- arboleje1: Predicting a bank's decision to give a loan for buying a car.
- autompg: The Auto MPG dataset
- autompg: The Auto MPG dataset
- baysout: Outlier detection using Bay and Schwabacher's algorithm.
- baysout: Outlier detection using Bay and Schwabacher's algorithm.
- breastw: The Breast Wisconsin dataset
- breastw: The Breast Wisconsin dataset
- bupa: The Bupa dataset
- bupa: The Bupa dataset
- ce.impute: Imputation in supervised classification
- ce.impute: Imputation in supervised classification
- ce.mimp: Mean or median imputation
- ce.mimp: Mean or median imputation
- census: census
- census: census
- chiMerge: Discretization using the Chi-Merge method
- chiMerge: Discretization using the Chi-Merge method
- circledraw: circledraw
- circledraw: circledraw
- clean: Dataset's cleaning
- clean: Dataset's cleaning
- colon: Alon et al.'s colon dataset
- colon: Alon et al.'s colon dataset
- combinations: Constructing distinct permutations
- crossval: Cross validation estimation of the misclassification error
- crossval: Cross validation estimation of the misclassification error
- crx: crx
- crx: crx
- cv10knn2: Auxiliary function for sequential feature selection
- cv10knn2: Auxiliary function for sequential feature selection
- cv10lda2: Auxiliary function for sequential forward selection
- cv10log: 10-fold cross validation estimation error for the classifier...
- cv10log: 10-fold cross validation estimation error for the classifier...
- cv10mlp: 10-fold cross validation error estimation for the multilayer...
- cv10mlp: 10-fold cross validation error estimation for the multilayer...
- cv10rpart2: Auxiliary function for sequential feature selection
- cv10rpart2: Auxiliary function for sequential feature selection
- cvnaiveBayesd: Crossvalidation estimation error for the naive Bayes...
- cvnaiveBayesd: Crossvalidation estimation error for the naive Bayes...
- decscale: Decimal Scaling
- decscale: Decimal Scaling
- diabetes: The Pima Indian Diabetes dataset
- diabetes: The Pima Indian Diabetes dataset
- disc.1r: Discretization using the Holte's 1R method
- disc.1r: Discretization using the Holte's 1R method
- disc2: Auxiliary function for performing discretization using equal...
- disc2: Auxiliary function for performing discretization using equal...
- disc.ef: Discretization using the method of equal frequencies
- disc.ef: Discretization using the method of equal frequencies
- disc.ew: Discretization using the equal width method
- disc.ew: Discretization using the equal width method
- disc.mentr: Discretization using the minimum entropy criterion
- disc.mentr: Discretization using the minimum entropy criterion
- discretevar: Performs Minimum Entropy discretization for a given attribute
- discretevar: Performs Minimum Entropy discretization for a given attribute
- distancia: Vector-Vector Euclidiean Distance Function
- distancia: Vector-Vector Euclidiean Distance Function
- distancia1: Vector-Vector Manhattan Distance Function
- distancia1: Vector-Vector Manhattan Distance Function
- dist.to.knn: Auxiliary function for the LOF algorithm.
- dist.to.knn: Auxiliary function for the LOF algorithm.
- dprep-package: Data Preprocessing for supervised classification
- dprep-package: Data Preprocessing for supervised classification
- ec.knnimp: Imputation using k-nearest neighbors.
- eje1dis: Basic example for discriminant analysis
- eje1dis: Basic example for discriminant analysis
- finco: FINCO Feature Selection Algorithm
- finco: FINCO Feature Selection Algorithm
- heartc: The Heart Cleveland dataset
- heartc: The Heart Cleveland dataset
- hepatitis: The hepatitis dataset
- hepatitis: The hepatitis dataset
- imagmiss: Visualization of Missing Data
- imagmiss: Visualization of Missing Data
- inconsist: Computing the inconsistency measure
- inconsist: Computing the inconsistency measure
- ionosphere: The Ionosphere dataset
- ionosphere: The Ionosphere dataset
- knneigh.vect: Auxiliary function for computing the LOF measure.
- knneigh.vect: Auxiliary function for computing the LOF measure.
- knngow: K-nn classification using Gower distance
- knngow: K-nn classification using Gower distance
- landsat: The landsat Satellite dataset
- landsat: The landsat Satellite dataset
- lofactor: Local Outlier Factor
- lofactor: Local Outlier Factor
- lvf: Las Vegas Filter
- lvf: Las Vegas Filter
- mahaout: Multivariate outlier detection through the boxplot of the...
- mahaout: Multivariate outlier detection through the boxplot of the...
- mardia: The Mardia's test of normality
- mardia: The Mardia's test of normality
- maxlof: Detection of multivariate outliers using the LOF algorithm
- maxlof: Detection of multivariate outliers using the LOF algorithm
- midpoints1: Auxiliary function for computing minimun entropy...
- midpoints1: Auxiliary function for computing minimun entropy...
- mmnorm: Min-max normalization
- mmnorm: Min-max normalization
- mo3: The third moment of a multivariate distribution
- mo3: The third moment of a multivariate distribution
- mo4: The fourth moment of a multivariate distribution
- mo4: The fourth moment of a multivariate distribution
- moda: Calculating the Mode
- moda: Calculating the Mode
- near1: Auxiliary function for the reliefcont function
- near1: Auxiliary function for the reliefcont function
- near3: Auxiliary function for the reliefcat function
- near3: Auxiliary function for the reliefcat function
- nnmiss: Auxiliary function for knn imputation
- nnmiss: Auxiliary function for knn imputation
- outbox: Detecting outliers through boxplots of the features.
- outbox: Detecting outliers through boxplots of the features.
- parallelplot: Parallel Coordinate Plot
- parallelplot: Parallel Coordinate Plot
- radviz2d: Radial Coordinate Visualization
- radviz2d: Radial Coordinate Visualization
- rangenorm: range normalization
- rangenorm: range normalization
- reachability: Function for computing the reachability measure in the LOF...
- reachability: Function for computing the reachability measure in the LOF...
- redundancy: Finding the unique observations in a dataset along with their...
- redundancy: Finding the unique observations in a dataset along with their...
- relief: RELIEF Feature Selection
- relief: RELIEF Feature Selection
- reliefcat: Feature selection by the Relief Algorithm for datasets...
- reliefcat: Feature selection by the Relief Algorithm for datasets...
- reliefcont: Feature selection by the Relief Algorithm for datasets with...
- reliefcont: Feature selection by the Relief Algorithm for datasets with...
- robout: Outlier Detection with Robust Mahalonobis distance
- robout: Outlier Detection with Robust Mahalonobis distance
- row.matches: Finding rows in a matrix equal to a given vector
- row.matches: Finding rows in a matrix equal to a given vector
- sbs1: One-step sequential backward selection
- sbs1: One-step sequential backward selection
- score: Score function used in Bay's algorithm for outlier detection
- score: Score function used in Bay's algorithm for outlier detection
- sffs: Sequential Floating Forward Method
- sffs: Sequential Floating Forward Method
- sfs: Sequential Forward Selection
- sfs: Sequential Forward Selection
- sfs1: One-step sequential forward selection
- sfs1: One-step sequential forward selection
- Shuttle: The Shuttle dataset
- Shuttle: The Shuttle dataset
- signorm: Sigmoidal Normalization
- signorm: Sigmoidal Normalization
- softmaxnorm: Softmax Normalization
- softmaxnorm: Softmax Normalization
- sonar: The Sonar dataset
- sonar: The Sonar dataset
- srbct: Khan et al.'s small round blood cells dataset
- srbct: Khan et al.'s small round blood cells dataset
- star3d: Data Visuaization using star coordinates in three dimensions
- star3d: Data Visuaization using star coordinates in three dimensions
- starcoord: The star coordinates plot
- starcoord: The star coordinates plot
- surveyplot: Surveyplot
- surveyplot: Surveyplot
- tchisq: Auxiliary function for the Chi-Merge discretization
- tchisq: Auxiliary function for the Chi-Merge discretization
- top: Auxiliary function for Bay's Ouylier Detection Algorithm
- top: Auxiliary function for Bay's Ouylier Detection Algorithm
- unor: Auxiliary function for performing Holte's 1R discretization
- unor: Auxiliary function for performing Holte's 1R discretization
- vehicle: The Vehicle dataset
- vehicle: The Vehicle dataset
- vvalen: The Van Valen test for equal covariance matrices
- vvalen: The Van Valen test for equal covariance matrices
- vvalen1: Auxiliary function for computing the Van Valen's...
- vvalen1: Auxiliary function for computing the Van Valen's...
- znorm: Z-score normalization
- znorm: Z-score normalization