Correct Functional Form Selection

James Hodkinson

Join Date: Apr 2019

Posts: 8
#1

Correct Functional Form Selection

27 Apr 2019, 06:12

I have been asked to verify the effect of water and sanitation on the mortality of children under the age of 5 and quantify whether providing water services or sanitation services has a larger effect on child mortality.

where

mortality rate of children under the age of 5 (per 1000 live births) - INFMORT
GDP per Capita - GDPPC
% with access to basic water services - WATER
% with access to basic sanitation services - SANIT

My starting model was INFMORT = b0 + b1 (GDPPC) + b2 (WATER) + b3 (SANIT)

I was wondering whether anyone had any suggestions as to the correct functional form to use as all the variations I have tried have delivered unexpected results.

Attached Files
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35727
#2

27 Apr 2019, 06:19

Usually GDP per head works better logged.

There is sometimes a case for looking at bounded percents on logit scale if 0 and 100 are not data values or possibly folded cube root scale if either 0 or 100 is a data value. I would be impressed however at any audience or readership that knew what folded cube roots ere.

More at http://fmwww.bc.edu/repec/bocode/t/transint.html (in Stata, ssc install transint installs that as a help file).

https://stats.stackexchange.com/ques.../195305#195305
Comment
James Hodkinson

Join Date: Apr 2019

Posts: 8
#3

27 Apr 2019, 06:32

Thanks for the prompt reply,

I though that that would be the correct form. I estimated the model

INFMORT = b0 + b1 ln(GDPPC) + b2 (WATER) + b3 (SANIT)

and ran regress however I have a very large p-value for WATER (0.948)

I am not sure why this is, I have checked for multicollinearity using the vif command and found nothing.

I apologise that I cannot include the data as I am on a university computer and cannot use dataex though I have included the results of my regression.

. vif

Variable | VIF 1/VIF

-------------+----------------------

SANIT | 3.74 0.267288

WATER | 3.16 0.316532

GDPPC | 1.71 0.583561

-------------+----------------------

Mean VIF | 2.87

Last edited by James Hodkinson; 27 Apr 2019, 06:35.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#4

27 Apr 2019, 11:46

I don't see why being on a university computer should be an issue unless you are using an out-of-date version of Stata, which you are asked to explain.
https://www.statalist.org/forums/help#version

You should still be able to do this

Code:

list lgdppc WATER SANIT INFMORT, sep(0)

and copy and paste the result in between CODE delimiters

Code:

like this

Which WATER value is (near) zero?
Comment

James Hodkinson

Join Date: Apr 2019
Posts: 8

28 Apr 2019, 04:31

Thank you very much for taking the time to reply again.
I have included the data below.
I believe that the zero value for WATER comes from observation 39 (Uzbekistan) as the value is missing.
I cannot remove or add any observations though.

Thanks again

Code:

     +--------------------------------------------+
     |   lgdppc       WATER       SANIT   INFMORT |
     |--------------------------------------------|
  1. | 9.526954   93.466427   87.487686      25.5 |
  2. | 9.857512     99.6273   94.839881      11.6 |
  3. | 9.011394   98.923652   91.582767        14 |
  4. | 10.68813   99.968702         100       3.8 |
  5. | 8.049608   97.327142   46.924593      36.3 |
  6. | 8.784395   92.881626   52.614027      38.2 |
  7. | 9.593288   97.497943   86.148352      15.7 |
  8. | 10.66856        98.9        98.5       5.1 |
  9. |  10.0229         100    99.88821       8.4 |
 10. | 9.515609   95.818029   75.037279      10.7 |
 11. | 8.083826   73.057416   29.931838      95.1 |
 12. | 10.72456         100    99.59723       4.3 |
 13. | 9.500883   94.477074   82.696982      31.5 |
 14. | 8.967651   93.013071   91.125074      15.5 |
 15. | 8.894631   93.597697    67.36292      29.5 |
 16. | 8.657661    87.55914   44.151123      45.2 |
 17. |  9.24645    89.52403   67.886608      27.3 |
 18. | 7.950024   58.456777    29.84468        51 |
 19. | 8.082588   87.268639   96.586078      22.3 |
 20. | 9.499475   92.258192   95.359365       8.4 |
 21. |  10.1267   96.433806   99.574007       8.2 |
 22. | 8.189137   69.607679   44.620624      83.9 |
 23. | 9.206757   78.787209   33.836767        48 |
 24. | 8.454403   88.545556   58.251404        81 |
 25. | 9.936646   95.000225    76.86704      16.9 |
 26. | 8.260109   36.596332    18.60417      56.2 |
 27. | 9.064075   98.885342   91.222225      20.6 |
 28. | 9.930371         100   81.812719       9.2 |
 29. | 10.83415   99.992893         100      13.3 |
 30. | 7.737988   75.188632   48.358136      49.5 |
 31. | 9.427493   84.697265   73.128824      44.1 |
 32. | 10.38249    99.94341    99.90436       3.4 |
 33. | 8.364129   58.931449   34.586105      67.1 |
 34. | 7.878757    74.13973   95.492142      44.5 |
 35. | 7.820422    50.14528   23.534066      58.8 |
 36. | 7.207394   62.817638   13.948482        78 |
 37. | 7.418151   38.921466   19.150955      55.9 |
 38. | 10.87666   99.200435   99.969551       6.6 |
 39. | 8.648263           .         100      25.8 |
 40. | 8.642488   91.191766   78.237448        22 |
     +--------------------------------------------+

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17713
#6

28 Apr 2019, 04:43

James:
please use -dataex- to share an example/excerpt of your data (see the FAQ). Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35727

28 Apr 2019, 05:37

@Carlo Lazzaro: James claims that he can't use dataex (#3, #4). Here is a dataex version any way.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(lgdppc WATER SANIT INFMORT)
9.526954  93.46643  87.48769 25.5
9.857512   99.6273  94.83988 11.6
9.011394  98.92365  91.58276   14
10.68813   99.9687       100  3.8
8.049608  97.32714  46.92459 36.3
8.784395  92.88162  52.61403 38.2
9.593288  97.49794  86.14835 15.7
10.66856      98.9      98.5  5.1
 10.0229       100  99.88821  8.4
9.515609  95.81803  75.03728 10.7
8.083826  73.05742 29.931837 95.1
10.72456       100  99.59723  4.3
9.500883  94.47707  82.69698 31.5
8.967651  93.01307  91.12508 15.5
8.894631  93.59769  67.36292 29.5
8.657661  87.55914  44.15112 45.2
 9.24645  89.52403 67.886604 27.3
7.950024  58.45678  29.84468   51
8.082588  87.26864  96.58607 22.3
9.499475  92.25819  95.35937  8.4
 10.1267  96.43381  99.57401  8.2
8.189137  69.60768  44.62062 83.9
9.206757  78.78721 33.836765   48
8.454403  88.54556   58.2514   81
9.936646  95.00022  76.86704 16.9
8.260109 36.596333  18.60417 56.2
9.064075  98.88535  91.22222 20.6
9.930371       100  81.81272  9.2
10.83415  99.99289       100 13.3
7.737988  75.18863  48.35814 49.5
9.427493  84.69727  73.12882 44.1
10.38249  99.94341  99.90436  3.4
8.364129  58.93145 34.586105 67.1
7.878757  74.13973  95.49214 44.5
7.820422  50.14528 23.534065 58.8
7.207394  62.81764 13.948482   78
7.418151  38.92147 19.150955 55.9
10.87666  99.20043  99.96955  6.6
8.648263         .       100 25.8
8.642488  91.19176  78.23745   22
end

A missing value wouldn't be plotted at all.

WATER and SANIT are certainly fighting for market share in your regression.

Install favplots from SSC and run after each regression you try..

Make sure that you look at a scatter plot matrix.

Also install multqplot from the Stata Journal. See https://www.statalist.org/forums/for...panel-data-set for a recent riff on its use.

Code:

. favplots

. graph matrix *, half

. multqplot *

I would feel a little more comfortable with say a cubed scale for WATER and SANIT -- odd though that may seem. Stretching the right-hand tail seems to have a process rationale too. Arm-waving a little, to get from 98 to 99 or from 99 to 100 is a bigger deal than say to get from 50 to 51 (on a percent scale).

Code:

. gen water2 = (WATER/100)^3
(1 missing value generated)

. gen sanit2 = (SANIT/100)^3

. regress I lgdppc water2 sanit2

      Source |       SS           df       MS      Number of obs   =        39
-------------+----------------------------------   F(3, 35)        =     40.14
       Model |  18905.8406         3  6301.94687   Prob > F        =    0.0000
    Residual |   5495.5286        35  157.015103   R-squared       =    0.7748
-------------+----------------------------------   Adj R-squared   =    0.7555
       Total |  24401.3692        38  642.141295   Root MSE        =    12.531

------------------------------------------------------------------------------
     INFMORT |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      lgdppc |  -4.867372   3.612887    -1.35   0.187    -12.20192    2.467178
      water2 |  -28.51761   11.96822    -2.38   0.023    -52.81439   -4.220826
      sanit2 |  -27.76032   8.795875    -3.16   0.003    -45.61689   -9.903744
       _cons |   110.5291   27.23144     4.06   0.000     55.24631    165.8119
------------------------------------------------------------------------------

Here are added variable plots showing identifiers for the regression reported earlier and for log infant mortality as a function of log GDP pc and water and sanit(ation) on a cube scale. The story is loosely the same, but I would here find small gains from transforming all the variables.

Note that 11, 22, 24 are hard to explain on your regression, but not so much on that suggested here.

To get the added variable plots, the recipe is something like

Code:

gen id = _n
favplots, ms(none) mla(id) mlabpos(0)

You need to define better variable labels for more attractive graphs -- and more intelligible ones too!

Click image for larger version

Name: favplots.png
Views: 1
Size: 29.0 KB
ID: 1495493

Click image for larger version

Name: favplots2.png
Views: 1
Size: 28.8 KB
ID: 1495494

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17713
#8

28 Apr 2019, 05:42

Sorry, my mistake.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#9

28 Apr 2019, 05:52

Carlo Lazzaro No; your advice really is good. For example, on my University system we set up Stata so that users can install extras (including my commands, surprise!). Other universities may work differently.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17713
#10

28 Apr 2019, 05:57

Nick:
interesting, thanks. (I would be surprised if the IT managers at Durham University, UK do not allow teachers and students to install the extras you developed!)

Kind regards,
Carlo
(Stata 19.0)
Comment
James Hodkinson

Join Date: Apr 2019

Posts: 8
#11

28 Apr 2019, 08:27

Thank you that is very informative.

From my understanding of functional form when adding polynomial terms you must leave the initial, squared and cubed terms in the model.

For example, using WATER as the sole independent variable.

Code:

water2 = (WATER/100)^2

water3 = (WATER/100)^3

regress I WATER water2 water3

I am also slightly unsure of why WATER must be divided by 100.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#12

28 Apr 2019, 09:27

Several misunderstandings here:

1. I was not trying to fit a polynomial in WATER and see no point in doing so.. The point of a cube transform of a predictor is here to improve the symmetry of its distribution (a secondary goal) and to improve the linearity of prediction (a primary goal). The test is how well it works, and the added variable plots are my evidence.

2. Dividing by 100 is a small device to keep numbers small (e.g. 100 cubed is 1 million), but it has no effect on any figures of merit (R-square, t statistics, P-values).

3. Using a cube, rather than say a square, is little arbitrary and comes partly from experience. It seems that you have not read the links given in #2 where this kind of issue is discussed. But my choice of cube is tailored to your data. Compare the choices of leaving a variable as it comes, of squaring it and of cubing it (one need not stop there, but I did it). For a variable with upper bound 100%, it is easier to think about transformations of the proportion or fraction, as any power of 1 is 1. Squaring a variable that varies only over a factor of 2 is a very mild transformation while cubing is stronger.

Here is a graph: The graph is drawn for a range corresponding to the observed range of WATER. SANIT raises similar issues.

Code:

twoway function identity=x/100, ra(WATER) lc(red) || function square=(x/100)^2, ra(WATER) lc(blue) || function cube=(x/100)^3, lc(black) ra(WATER) legend(col(1) pos(5) ring(0))

So, using a cube is a device for a predictor that is a proportion bounded by 1 (so logit doesn't apply without some fudge) that works on the awkward asymmetry (skewness) and more importantly the nonlinearity evident in the original data.

A point made only rarely in discussion, but one quite often useful in my own work, is that squaring or cubing proportions helps whenever left-skewness is present.

Other way round, if you try this for your data,

Code:

forval j = 1/3 { gen w`j' = (WATER/100)^`j' } corr w? (obs=39) | w1 w2 w3 -------------+--------------------------- w1 | 1.0000 w2 | 0.9929 1.0000 w3 | 0.9765 0.9951 1.0000 graph matrix w?, half

you will see that given the small range the three powers (1, 2, 3) of WATER are so highly correlated that including them all in a regression would be of dubious merit.

Attached Files

Last edited by Nick Cox; 28 Apr 2019, 09:37.
Comment

Announcement