-
oa Computational and statistical challenges with high dimensionality: A new method and efficient algorithm for feature selection in knowledge discovery
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2012 Issue 1, أكتوبر ٢٠١٢, المجلد 2012, CSO2
ملخص
Qatar is currently building one of the largest research infrastructures in the Middle East. In this orientation, Qatar foundation has constructed a number of universities and institutes composed of highly qualified researchers. In particular, QCRI institute is forming a scientific computing multidisciplinary group with a special interest in machine learning, data mining and bioinformatics. We are now able to address the computational and statistical needs of a variety of researchers with a vital set of services contributing to the development of Qatar. The availability of massive amounts of data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. There is little doubt that high-dimensional data analysis will be the most important research topic in statistics in the 21st century. Indeed, the challenges of high-dimensionality arise in diverse fields of sciences, engineering, and humanities, ranging from genomics and health sciences to economics, finance, and machine learning and data mining. For example, in biomedical studies, huge numbers of magnetic resonance images (MRI) and functional MRI data are collected for each subject with hundreds of subjects involved. Satellite imagery has been used in natural resource discovery and agriculture, collecting thousands of high resolution images. Other examples of these kinds are plentiful in computational biology, climatology, geology and neurology among others. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. In this paper, we propose a computationally intensive method for regularization and variable selection in linear models. The method is based on penalized least squares with a penalty function that is a combination of the minimum concave penalty (MCP) and an L2 penalty on successive differences between coefficients. We call it the SF-MCP method. Extensive simulation studies and applications to large biomedical datasets (leukemia and glioblastoma cancers, diabetes, proteomics and metabolomics data sets) show that our approach outperforms its competitors in terms of prediction of errors and identification of relevant genes that are responsible of some lethal diseases.