Regularized Regression: Good for Goldilocks or the Worst of Both Worlds?

Is regularized regression a happy medium or does it simply combine the disadvantages of traditional regression and machine learning approaches? The answer is that there is a bit of truth to both outlooks.

(Note: Moving forward I refer to regression as more traditional linear/semi-linear prediction models for either continuous or classification tasks. This is as opposed to the definition of regression that applies only to continuous variables and not to classification.)

Modern machine learning has advantages and disadvantages in itself. I will focus on the more extreme example of neural networks. These networks allow you to input correctly formatted data and, given enough complexity in the network, it will automatically transform the data and learn complex relationships and interactions between various variables and what you are trying to predict.

Unfortunately, neural networks also ask a lot of your data. Essentially everything I just described is determined based on the provided data alone along with best practices for employing neural networks. This isn’t a big issue when you provide the algorithm a large, complete, and accurate dataset. Indeed, many datasets are becoming larger and more complete. But at the same time smaller and preliminary data appear every day. Estimating many coefficients and data relationships based on limited data might create results that are less replicable than they are worth.

Further, neural networks create a vast array of relationships, interactions, and coefficients that are relatively impossible to interpret directly. This is what people are talking about when they call them a ‘black box.’ There are ways to somewhat avoid this disadvantage by studying correlations between various input variables and your final outputs avoiding the details of the model itself. While this is definitely valuable, you may frequently find that many variables have unlikely correlations with your output and that this can only be changed with significant manual overrides. Also, the nonlinear nature of the transformations can still lead to unexpected results for various ranges and combinations of input values that are found in or not found in your current dataset.

Traditional regression models, on the other hand, are generally created manually. Exploratory analysis of the data and a survey of existing knowledge inform this approach. Variables are added due to subject matter expertise, theoretical knowledge, and correlations in the data. Nonlinear effects and interactions are added intentionally and sparingly. Various statistical tests are performed to determine the validity of the resulting models along with a comparison of estimated results to previous subject matter knowledge and theory. Finally, prediction or forecasting performance may be evaluated to compare results to existing data and the plausibility of predictions for outcomes not known.

The advantage of traditional regression models is that you are not just relying on the available data to estimate the model. Instead, you are bringing an array of outside information such as subject matter knowledge, assumed statistical distributions, or creative exploratory analysis of existing data informed by your human mind trained on similar tasks. Further, these models often include complexities like nonlinear effects and nonlinearities only as required. This approach asks much less out of your dataset to determine complex relationships. While the result may be less realistic, in traditional modeling realism is frequently traded for practicality. In many applications, simpler linear models based on a handful of strong predictors can achieve most of the predictive performance of more complex models.

Further, traditional models are not only easier to interpret but frequently create interpretations that are more palatable. Estimated coefficients are relatively few in number and it is easy to determine the estimated impact between predictors and the outcome. Since the model is being created manually anyways, manual override of difficult to explain relationships is more straight forward. However, these manual overrides may close the analyst off from unexpected and perhaps true relationships in the data. In other words, assumptions are only useful when they are correct. Further, this manual process is labor intensive to begin with and intensive to maintain. Every time new data is introduced, an analyst must ensure the litany of tests and checks are replicated and, if they no longer pass, the entire development process may begin again. This is why many traditional models are often out of date despite updated data being a key aid in predictive performance.

So is there a compromise to be had between these two approaches? To some extent there is. Approaches such as regularize regression (lasso, ridge, and elastic net), regularized logistic regression, and regularized proportional hazards models allow for automatic variable selection and estimation based on predictive performance while omitting the complex non-linear effects of neural networks. The result is similar to classic variable selection approaches such as forward, backward, and stepwise selection. However, regularized regression lacks some disadvantages of these approaches such as the impact of different orders of adding variables to the model. Further, regularized regression allows for the shrinking of uncertain coefficients instead of eliminating these variables from the model altogether.

I personally believe that regularized regression is great approach in situations where your dataset isn’t large or complete enough to make a more complicated approach wise or worthwhile. Since the model is grounded mostly in predictive performance, the analyst forgoes performing a long list of statistical tests. The model can be estimated and re-estimated via algorithm. Nonlinear effects can also be added to the model by simply transforming the inputs manually. Further, the model produces coefficients that are interpretable in the same way as a traditional regression.

Nonetheless, regularized regression does combine many drawbacks. While you can easily interpret these models, you might not want to. It’s relatively common for these models to create nonsensical relationships such as a disease improving health outcomes. If you desire to eliminate these relationships manually or add constraints to your estimation, then the benefit of automation is decreased. Further, in cases where you have large and complete datasets, the basic linear assumptions of any regression models are likely to leave a lot of information on the table in terms of interactions and nonlinear effects.

All in all, there are many applications for regularized regression especially for analysts who want to achieve good predictive performance on limited datasets while reducing potentially unnecessary work on their part. However, in cases where interpretable models are paramount or when predictive performance on complex problems needs to be optimized there are probably better options.

Below is a short project I did with scikit survival implementing regularized survival regressions. Also, my Northern Michigan search interest forecasts on this site are partially created using regularized regression.

scikitsurvival Download