Maximum-Likelihood Inference after Model Selection
Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models. However, these often rely on conditioning on extra information beyond the selection of a model, a practice that is known to result in a loss of efficiency. Furthermore, very little attention has been given to estimation of selected parameters. In our work, we propose to take the arguably most ubiquitous approach to data analysis, that of computing maximum likelihood estimates and constructing confidence intervals based on their asymptotic properties. We get around the intractable likelihood by efficiently generating noisy unbiased estimates of the post-selection score function and using them in a stochastic ascent algorithm. We apply the proposed technique to the problem of estimating linear models selected by the lasso. In an asymptotic analysis the resulting estimates are shown to be consistent for the selected parameters and to have a limiting truncated normal distribution. In a simulation study, confidence intervals constructed based on the asymptotic distribution obtain close to nominal coverage rates and the point estimates are shown to be superior to the lasso estimates when the true model is sparse. To conclude, we briefly discuss two applications. The first is in GWAS data, where we develop a methodology for performing inference in regression models that pass an aggregate test and the second is in fMRI data, where we propose techniques for performing cluster wise analysis.