Una   comparación de métodos de imputación de variables categóricas con patrón univariado

Juan Armando Torres Munguía

doi:10.46661/revmetodoscuanteconempresa.2196

Authors

Juan Armando Torres Munguía Maestría en Estadística Aplicada Instituto Tecnológico y de Estudios Superiores de Monterrey (México)

DOI:

https://doi.org/10.46661/revmetodoscuanteconempresa.2196

Keywords:

Imputation methods, hot-deck, polytomous regression, random forests, smoking habits, missing categorical data

Abstract

This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico) was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches.

Downloads

Download data is not yet available.

References

Andridge, R. and Little, R. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), pp. 40–64.

Bacallao, J. and Bacallao, J. (2010). Imputación Múltiple en Variables Categóricas Usando Data Augmentation y Árboles de Clasificación. Investigación Operacional, 31 (2), pp. 133–139.

Barceló, C. (2008). The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the Spanish Survey of Household Finances. (No. 0829). Banco de España.

Burton, A., Billingham, L.J., and Bryan, S. (2007). Cost-effectiveness in clinical trials: using multiple imputation to deal with incomplete cost data. Clinical Trials, 4 (2), pp. 154–161.

Chauvet, G., Deville, J.C., and Haziza, D. (2011). On balanced random imputation in surveys. Biometrika, 98 (2), pp. 459–471.

Desai, M., Esserman, D.A., Gammon, M.D., and Terry, M.B. (2011). The use of complete-case and multiple imputation-based analyses in molecular epidemiology studies that assess interaction effects. Epidemiologic Perspectives and Innovations, 8 (1), 5.

Durrant, G.B. (2005). Imputation methods for handling item-nonresponse in the social sciences: a methodological review. NCRM Methods Review Papers. ESRC National Centre for Research Methods and Southampton Statistical Sciences Research Institute. NCRM/002.

Eisemann, N., Annika, W., and Alexander, K. (2011) Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research Methodology, 11.

Farhangfar A, Kurgan L, and Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41 (12), pp. 3692–3705

Follmann, D., Elliott, P., Suh, I., and Cutler, J. (1992). Variance imputation for overviews of clinical trials with continuous response. Journal of clinical epidemiology, 45 (7), pp. 769–773.

Ghosh-Dastidar, B., and Schafer, J.L. (2003). Multiple edit/multiple imputation for multivariate continuous data. Journal of the American Statistical Association, 98 (464), pp. 807–817.

Gimotty, P.A. and Brown, M.B. (1990). Imputation procedures for categorical data: their effects on the goodness-of-fit chi-square statistic. Communications in Statistics - Simulation and Computation, 19 (2), pp. 681–703.

Hill, J. (2012) Four Techniques for Dealing with Missing Data in Criminal Justice. Paper presented at the annual meeting of the ASC Annual Meeting, Palmer House Hilton, Chicago, IL, Nov 13, 2012.

Hosmer, D.W. and Lemeshow, S. (1989). Introduction to the Logistic Regression Model. Applied Logistic Regression, Second Edition, pp. 1–30.

Kalton, G. and Kish, L. (1981). Two efficient random imputation procedures. Proceedings of the survey research methods section (pp. 146–151).

Little, R.J. (1988). A test of missing completely at random for multivariate data with missing values Journal of the American Statistical Association, 83 (404), pp. 1198–1202.

Little, R.J. and Rubin, D.B. (1987). Statistical analysis with missing data (Vol. 539). New York: Wiley.

Little, R.J. and Rubin, D.B. (2002). Statistical analysis with missing values. Wiley, New York.

Little, R.J. and Schluchter, M.D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72 (3), pp. 497–512.

Matsubara, E.T., Prati, R.C., Batista, G.E., and Monard, M.C. (2008). Missing value imputation using a semi-supervised rank aggregation approach. Advances in Artificial Intelligence-SBIA 2008 (pp. 217–226). Springer Berlin Heidelberg.

Panranowitz, A. and Marwala, T. (2009) Missing Data Imputation Through the Use of the Random Forest Algorithm. Advances in Intelligent and Soft Computing Volume 116, pp. 53–62.

Rieger, A., Hothorn, T., and Strobl, C. (2010). Random Forests with Missing Values in the Covariates.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, pp. 581–592.

Schafer, J.L. and Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological methods, 7 (2), 147.

Segal, M.R. (2004). Machine learning benchmarks and random forest regression

Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., and Cubiles-de-la-Vega, M.D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), pp. 121–129.

Song, Q., Shepperd, M., and Cartwright, M. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10 (2), pp. 235–243.

Souverein, O.W., Zwinderman, A.H., and Tanck, M.W. (2006). Multiple imputation of missing genotype data for unrelated individuals. Annals of human genetics, 70 (3), pp. 372–381.

Stekhoven, D.J. and Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), pp. 112–118.

van Buuren, S. (2012). Flexible imputation of missing data. CRC press.

Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Make a Submission

scopus

sjr

DOAJ

latindex

erihplus

dialnet

visits

Current Issue