In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

Course Queries Syllabus Queries 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Syllabus Queries related to Course Queries. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

Am I looking for a better behaved distribution for the independent variable in question, or to reduce the effect of outliers, or something else?

profilepic.png
manpreet 2 years ago


I always hesitate to jump into a thread with as many excellent responses as this, but it strikes me that few of the answers provide any reason to prefer the logarithm to some other transformation that "squashes" the data, such as a root or reciprocal.

Before getting to that, let's recapitulate the wisdom in the existing answers in a more general way.Some non-linear re-expression of the dependent variable is indicated when any of the following apply:

  • The residuals have a skewed distribution. The purpose of a transformation is to obtain residuals that are approximately symmetrically distributed (about zero, of course).

  • The spread of the residuals changes systematically with the values of the dependent variable ("heteroscedasticity"). The purpose of the transformation is to remove that systematic change in spread, achieving approximate "homoscedasticity."

  • To linearize a relationship.

  • When scientific theory indicates. For example, chemistry often suggests expressing concentrations as logarithms (giving activities or even the well-known pH).

  • When a more nebulous statistical theory suggests the residuals reflect "random errors" that do not accumulate additively.

  • To simplify a model. For example, sometimes a logarithm can simplify the number and complexity of "interaction" terms.

(These indications can conflict with one another; in such cases, judgment is needed.)

So, when is a logarithm specifically indicated instead of some other transformation?

  • The residuals have a "strongly" positively skewed distribution. In his book on EDA, John Tukey provides quantitative ways to estimate the transformation (within the family of Box-Cox, or power, transformations) based on rank statistics of the residuals. It really comes down to the fact that if taking the log symmetrizes the residuals, it was probably the right form of re-expression; otherwise, some other re-expression is needed.

  • When the SD of the residuals is directly proportional to the fitted values (and not to some power of the fitted values).

  • When the relationship is close to exponential.

  • When residuals are believed to reflect multiplicatively accumulating errors.

  • You really want a model in which marginal changes in the explanatory variables are interpreted in terms of multiplicative (percentage) changes in the dependent variable.

Finally, some non - reasons to use a re-expression:

  • Making outliers not look like outliers. An outlier is a datum that does not fit some parsimonious, relatively simple description of the data. Changing one's description in order to make outliers look better is usually an incorrect reversal of priorities: first obtain a scientifically valid, statistically good description of the data and then explore any outliers. Don't let the occasional outlier determine how to describe the rest of the data!

  • Because the software automatically did it. (Enough said!)

  • Because all the data are positive. (Positivity often implies positive skewness, but it does not have to. Furthermore, other transformations can work better. For example, a root often works best with counted data.)

  • To make "bad" data (perhaps of low quality) appear well behaved.

  • To be able to plot the data. (If a transformation is needed to be able to plot the data, it's probably needed for one or more good reasons already mentioned. If the only reason for the transformation truly is for plotting, go ahead and do it--but only to plot the data. Leave the data untransformed for analysis.)


0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.