Why is multicollinearity not checked in modern statistics/machine learning

General Tech Learning Aids/Tools 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Learning Aids/Tools related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

In traditional statistics, while building a model, we check for multicollinearity using methods such as estimates of the variance inflation factor (VIF), but in machine learning, we instead use regularization for feature selection and don't seem to check whether features are correlated at all. Why do we do that?

profilepic.png
manpreet 2 years ago

 

Considering multicollineariy is important in regression analysis because, in extrema, it directly bears on whether or not your coefficients are uniquely identified in the data. In less severe cases, it can still mess with your coefficient estimates; small changes in the data used for estimation may cause wild swings in estimated coefficients. These can be problematic from an inferential standpoint: If two variables are highly correlated, increases in one may be offset by decreases in another so the combined effect is to negate each other. With more than two variables, the effect can be even more subtle, but if the predictions are stable, that is often enough for machine learning applications.

Consider why we regularize in a regression context: We need to constrict the model from being tooflexible. Applying the correct amount of regularization will slightly increase the bias for a larger reduction in variance. The classic example of this is adding polynomial terms and interaction effects to a regression: In the degenerate case, the prediction equation will interpolate data points, but probably be terrible when attempting to predict the values of unseen data points. Shrinking those coefficients will likely minimize or entirely eliminate some of those coefficients and improve generalization.

A random forest, however, could be seen to have a regularization parameter through the number of variables sampled at each split: you get better splits the larger the mtry (more features to choose from; some of them are better than others), but that also makes each tree more highly correlated with each other tree, somewhat mitigating the diversifying effect of estimating multiple trees in the first place. This dilemma compels one to find the right balance, usually achieved using cross-validation. Importantly, and in contrast to a regression analysis, no part of the random forest model is harmed by highly collinear variables: even if two of the variables provide the same child node purity, you can just pick one without diminishing the quality of the result.

Likewise, for something like an SVM, you can include more predictors than features because the kernel trick lets you operate solely on the inner product of those feature vectors. Having more features than observations would be a problem in regressions, but the kernel trick means we only estimate a coefficient for each exemplar, while the regularization parameter CC reduces the flexibility of the solution -- which is decidedly a good thing, since estimating NN parameters for NN observations in an unrestricted way will always produce a perfect model on test data -- and we come full circle, back to the ridge/LASSO/elastic net regression scenario where we have the model flexibility constrained as a check against an overly optimistic model. A review of the KKT conditions of the SVM problem reveals that the SVM solution is unique, so we don't have to worry about the identification problems which arose in the regression case.

Finally, consider the actual impact of multicollinearity. It doesn't change the predictive power of the model (at least, on the training data) but it does screw with our coefficient estimates. In most ML applications, we don't care about coefficients themselves, just the loss of our model predictions, so in that sense, checking VIF doesn't actually answer a consequential question. (But if a slight change in the data causes a huge fluctuation in coefficients [a classic symptom of multicollinearity], it may also change predictions, in which case we do care -- but all of this [we hope!] is characterized when we perform cross-validation, which is a part of the modeling process anyway.) A regression is more easily interpreted, but interpretation might not be the most important goal for some tasks.


0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

tuteehub community

Join Our Community Today

Ready to take your education and career to the next level? Register today and join our growing community of learners and professionals.

tuteehub community