Subsampling for Feature Selection in Large Regression Data
主 题: Subsampling for Feature Selection in Large Regression Data
报告人: Prof. Jiayang Sun (Case Western Reserve University)
时 间: 2016-12-17 15:00-15:30
地 点: 理科一号楼 1114
Feature selection from a large number of features in a regression analysis remains a challenge to data science. One popular approach to feature selection in large regression data with sparse features is to use a penalized likelihood or a shrinkage estimation, such as LASSO, SCAD, elastic net, and MCP penalty.We present a different approach using a new subsampling method, called a Subsampling Winner algorithm (SWA) for feature selection in large regression data. The central idea of our approach is analogous to that for the election of National Merit Scholars. SWA uses a `base procedure' on each of subsamples, computes the scores of all features according to their performance in each of the subsample analyses, then obtains the `semifinalist' by ranking the resulting scores, and finally determines the `finalists,' aka the important features from the `semifinalist.' Due to its subsampling nature, SWA applies to data of any dimension in principle, including data that are too large to use a statistical procedure on the full data by an existing software package. We provide paneling plots for choosing the subsample size, compare our method with ElasticNet (a generalization of LASSO), SCAD, MCP and RandomForest, and illustrate an SWA's application to a genomic data about Ovarian cancer. (Joint work with Y. Richard Fan)