Is it possible to divide a variable by the mean across individuals for a regression???

Rebecca Water

Join Date: Sep 2018
Posts: 44

Is it possible to divide a variable by the mean across individuals for a regression???

27 Dec 2018, 21:25

Hello,

is it okay to divide each individual's value of a variable by the mean of the sample and then use this transformed variable for a regression in a sample?

For example:

Code:

sysuse auto, clear

. reg price trunk weight displacement gear_ratio

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(4, 69)        =      8.54
       Model |   210211246         4  52552811.6   Prob > F        =    0.0000
    Residual |   424854150        69  6157306.52   R-squared       =    0.3310
-------------+----------------------------------   Adj R-squared   =    0.2922
       Total |   635065396        73  8699525.97   Root MSE        =    2481.4

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trunk |  -63.64507   91.74253    -0.69   0.490    -246.6664    119.3763
      weight |   2.160798   .8998892     2.40   0.019     .3655685    3.956028
displacement |   10.36613   8.266774     1.25   0.214    -6.125634    26.85789
  gear_ratio |   2192.778   1140.727     1.92   0.059    -82.91105    4468.466
       _cons |  -8139.774   4688.715    -1.74   0.087     -17493.5    1213.956


egen meanprice = mean(price)
gen dividedprice = price/meanprice


. reg dividedprice trunk weight displacement gear_ratio

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(4, 69)        =      8.54
       Model |  5.53036265         4  1.38259066   Prob > F        =    0.0000
    Residual |   11.177316        69  .161990087   R-squared       =    0.3310
-------------+----------------------------------   Adj R-squared   =    0.2922
       Total |  16.7076786        73   .22887231   Root MSE        =    .40248

------------------------------------------------------------------------------
dividedprice |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trunk |  -.0103232   .0148806    -0.69   0.490    -.0400091    .0193627
      weight |   .0003505    .000146     2.40   0.019     .0000593    .0006417
displacement |   .0016814   .0013409     1.25   0.214    -.0009936    .0043563
  gear_ratio |   .3556669   .1850251     1.92   0.059    -.0134481    .7247819
       _cons |  -1.320265    .760506    -1.74   0.087    -2.837433    .1969028
------------------------------------------------------------------------------

Would this cause any trouble?
The motivation would be to see what factors effect the price to lie above the sample average.
Thank you!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#2

27 Dec 2018, 21:43

Would this cause any trouble?

No. You're just changing the scale of the variable, like changing a monetary measure from dollars to euros or yen or even dollars to thousands or dollars.

The motivation would be to see what factors effect the price to lie above the sample average.

But it won't do that. As noted, it's just like changing currency units: nothing substantive changes. If you look at the outputs you generated you will notice that all of the coefficients and standard errors in the first regression table are exactly 6165.26 (not coincidentally, this is the mean value of price) times the corresponding figures in the second one (to within rounding error). And notice that all of the t-statistics and p-values agree exactly, so that any inferences made from either model would be exactly the same in the other.

If you just want to identify factors that are associated with higher prices, then whether you rescale or not, those are just the factors where the regression coefficients are positive. If you really want to identify factors that increase the probability of the price being above the mean price, then you need to create a dichotomous outcome variable that is 1 in observations where price is greater than the mean and 0 elsewhere. Then model that dichotomous variable, say, with a logistic regression, or something like that.
2 likes
Comment
Rebecca Water

Join Date: Sep 2018

Posts: 44
#3

27 Dec 2018, 22:43

Thank you! Your way of explaining things is very pleasant and clear.

Do you think it is, however, useful to scale the variable like this so it reads as percentages if one simultaneously examined another dependent variable that naturally comes as a percentage like … the maximum speed in comparison to the manufacturer information?
I mean, do you think the estimates are then more comparable?

E.g.
weight | .0003505

A one unit increase in weight increases the price by 0.0004 percentage points relative to itt mean in the sample.
A one unit increase in weight increases the veracity of speed information by … percentage points.

Taking the logarithms would do the trick maybe, but I wonder what to do if one was to compare estimates between variables of different units/scales when taking logarithms is no option because of negative values of a variable.

Last edited by Rebecca Water; 27 Dec 2018, 22:48. Reason: typo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#4

28 Dec 2018, 09:00

Do you think it is, however, useful to scale the variable like this so it reads as percentages if one simultaneously examined another dependent variable that naturally comes as a percentage like … the maximum speed in comparison to the manufacturer information?
I mean, do you think the estimates are then more comparable?

I think that this kind of usefulness is in the eye of the beholder. And it depends on the natural units of the original variables and how familiar your audience is with them. To me, if the outcome variable is price, and you give me a regression coefficient that tells me that a unit increase in some predictor is associated with a certain number of dollars difference in price, that seems crystal clear. If you tell me that the same unit increase in a predictor is associated with a difference in price equal to some fraction or percentage of the average price, I will not understand that unless I am already familiar with what the average price actually is. And even then, it forces me to go through the mental arithmetic of converting that percentage of the average price into actual dollars. So in that setting it seems perverse. On the other hand, if the outcome variable in question has no natural units because it is, say, a homebrew index that you have developed for the purpose of the study and nobody, except perhaps you, has any intuitions about what its values really mean, then putting things in terms that are relative to a mean value might indeed be clarifying.

Taking the logarithms would do the trick maybe, but I wonder what to do if one was to compare estimates between variables of different units/scales when taking logarithms is no option because of negative values of a variable.

So I'm getting the impression that you are thinking about presenting elasticities. There are several things to be careful about here.

1. Zero and negative values pretty much rule out the use of (semi-)elasticities. It's not just that you can't take logarithms. Forget logarithms and just think about it in terms of the outcome itself. If a unit change in a predictor is associated with increasing the outcome measure from -1 to +1, what is the percentage change in the outcome? Does that make any sense? Even worse, if it is associated with increasing the outcome measure from 0 to 1, what is the precentage change? It's infinite! Does that make any sense? So it's not that logarithms are a technical barrier to calculating elasticities: this property of the logarithm function arises precisely for the same reason that elasticities make no sense with zero and negative values.

2. There are modeling issues here. You do not have the freedom to say, well, I think (semi-) elasticities are easier to understand than coefficients here so I'll model ln y vs x, resp. ln x. The logarithm is a non linear function, so it y is linearly related to x, then ln y is not linearly related to x, nor to ln x, so that the logarithmic model is a mis-specification. So you really have to have a clear picture (literally, pictures, i.e. graphs, are very helpful in this regard) of what the relationship is. If the relationship between y and x is linear, then a given increase in x is associated with some regular change in y, whereas the percentage change in y will vary depending on the value of y you start from. For example, if the regression coefficient is 1.5, then increasing x by 1 is associated with an increase of 1.5 in y. If the initial y is 3, then it goes to 4.5, which is a 50% increase. But if the initial y is 6, then it goes to 7.5, which is a 25% increase. Similarly, if the relationship is really one where a given increase in x is associated with a constant percentage increase in y, then the proper specification of that model is ln y as a function of x. And in that case a model simply of y vs x can be quite misleading. So you need to clarify which model is a better specification of reality. As I said, graphing the relationship can be very helpful. Sometimes it's not: and, in particular, if the range of values of the x and y variables are pretty narrow, then the y vs x and ln y vs x and ln y vs ln x can all look pretty linear, and you may not be able to tell. If there is no theoretical basis for thinking that one of these models is closer to reality, then you are free to choose the one that is most convenient for you. But over a wide range of x and y values, it will usually be clear that one of these models is much closer to linear than the others (or perhaps that none of them are at all close to linear--in which case you have bigger problems to deal with.)
2 likes
Comment

Announcement

Is it possible to divide a variable by the mean across individuals for a regression???

Comment

Comment

Comment