Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • graphing splined regressions; continuity at knots

    I have used mkspline to estimate a regression, and want to graph the predicted value of the dependent variable as income varies. I did what seemed natural. Splined it in five sections -- using percentiles, so the knot points (inc1 through inc5) are at the 20th, 40th, etc. percentile of the sample's income distribution. I then tried " margins, at (inc1=(0 10000 20000 29835)) " which I believe gives average predicted value if everyone in the sample kept their other characteristics but income was fixed at the respective values (feel free to note any errors before I get to the question). The last value is the location of the first knot. Then I graphed this sucker using marginsplot. I then " margins, at (inc2=(29835 30000 etc.) and marginsplot for this segment.

    Issue 1: the two lines do not produce the same margin at the first knot point (or any of the others). Using the top value for inc1, the depvar is predicted to be .006829, using the bottom value for inc2 the prediction is .1146 -- the two values are too far apart to be roundoff error. Doesn't mkspline produce linear segments that are continuous at each knot?

    Issue 2 : If I solve issue 1, I'd like to combine the graphs into one graph where the vertical axis is the predicted depvar and the horizontal is income ranging across the splines. Is there an easy way to do that.

    Thanks for reading this far, and I'll appreciate any help.

  • #2
    Rich,

    You're on the right track, I think, but there may be a couple of problems with what you are doing. However, in order to comment more intelligently, it will help if you can give the regression command that you actually used before doing margins and marginsplot.

    Regards,
    Joe

    Comment


    • #3
      My theory is that margins does not realize that the values of inc1, inc2, etc, are not independent of each other. i.e. it doesn't realize that if inc1 = 10,000, then inc2 has to equal 10,000 too (or whatever it equals given the way you did the mkspline command.) So, it might have to be more something like

      at(inc1 = 10000 inc2 = 10000 inc3 = 10000) at (inc1 = 29000 inc2 = ...

      By way if analogy, suppose a model had inc and inc^2. If you did

      reg y inc c.inc#c.inc
      margins, at(inc = 1000 10000 50000)

      margins would get it right because it would know that inc and inc^2 were related to each other. But if instead you did

      gen incsq = inc * inc
      reg y inc incsq
      margins, at(inc = 1000 10000 50000)

      margins would get it wrong because it doesn't know that if inc = 1000 then incsq = 1,000,000 So, the margins command would instead need to be something like

      margins, at(inc = 1000 incsq = 100000) at(inc = 10000 incsq = 100000000) etc

      I may not have the syntax exactly right, but my basic theory is that if you set inc1 = something, you also have to tell margins what the other spline variables then equal. There isn't a convenient way to use factor variable notation in this case.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Ok, here is a simple and replicable example of what I mean.

        Code:
        use "http://www3.nd.edu/~rwilliam/statafiles/blwh.dta", clear
        mkspline educ1 12 educ2 = educ, marginal
        tab2 educ1 educ2
        reg  income educ1 educ2 jobexp i.black
        * Margins doesn't know how educ1 and educ2 are related
        quietly margins, at(educ1 = (8 12 16))
        marginsplot
        * Ergo, we fill in the corresponding values for educ2
        quietly margins, at(educ1 = 8 educ2 = 0) at(educ1 = 12 educ2 = 0) at(educ1 = 16 educ2 = 4)
        marginsplot
        If I am right, I think this will solve both of your problems, i.e. you will get the graph you want too.
        Last edited by Richard Williams; 30 Jun 2014, 20:41.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Rich,

          You may also want to check out marginscontplot (from Stata Journal), aka mcp, which specializes in plotting margins for continuous covariates. Here is a way that I have used it in the past, applied to the data Richard presents above:

          Code:
          use "http://www3.nd.edu/~rwilliam/statafiles/blwh.dta", clear
          mkspline educ1 12 educ2 = educ, marginal
          reg income educ1 educ2 jobexp i.black
          
          tab educ
          gen educrange=_n in 1/21    // Make a dummy set of education variables for mcp
          mkspline er1 12 er2 = educrange
          mcp educ (educ1 educ2), var1(educrange (er1 er2)) margopts(predict(xb))
          Note that the margin option to mkspline is significant. Both using it and not using it are interesting; it just depends on what you want to show. I don't know if it is crucial to Richard's demonstration, but it's worth understanding it and playing around with it anyway.

          Regards,
          Joe

          P.S. The reason I wanted to see your regression model is because I was wondering whether you had any interactions between the splined variable and some other covariate. However, that is only necessary if you are estimating different lines for each value of a categorical covariate and you want those lines to also be splined. The above refers to the simple case where you are just looking to estimate one "line" (consisting of several segments).

          Comment


          • #6
            There is an example that uses margins with splines and interactions here: http://maartenbuis.nl/wp/inter_quadr/inter_quadr.html
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Thanks Joe and Richard. Richard, you have it. Note that the short and long documentation for this uses the equivalent of educ2=educ in examples, but doesn't really explain what this is doing, so perhaps documentation needs improvement.

              I have more knot points than your example, and although I think I adapted things appropriately and get semi-plausible estimates, I know that that is no guarantee I did it right. So, if you have nothing better to do with your time, glance over this. I am also having problems with graphing -- it wants to label the x axis with 6 variable values for each plotted point reflecting each spline, and does so in an unreadable and not to scale way. Maybe I can solve that one myself, but if you have any thoughts they would be appreciated.

              Here is what I did (if curious, the depvar is charitable donations/family income and the indep var we are splining is family income. It is commonly believed that this is a U-shaped curve, but we seem to be showing otherwise with far superior data. The funny numbers for knot points represent quintiles of the sample income distribution).

              mkspline incp1 29835.05 incp2 40425.994 incp3 51286.80078125 incp4 77691.59375 incp5 119319.865625 incp6=totincfam
              regress dony incp1-incp6 $control if Atot>0, vce(cluster id68)
              margins, at (incp1=10000 incp2=0 incp3=0 incp4=0 incp5=0 incp6=0) at (incp1=29835 incp2=0 incp3=0 incp4=0 incp5=0 incp6=0) at (incp2=29836 incp1=29835.05 incp3=0 incp4=0 incp5=0 >incp6=0) at (incp2=40425 incp1= 29835.05 incp3=0 incp4=0 incp5=0 incp6=0) at (incp3=40426 incp1=29835.05 incp2= 40425.994 incp4=0 incp5=0 incp6=0) at (incp3=51286 incp1=29835.05 >incp2=40425.994 incp4=0 incp5=0 incp6=0) at (incp4=51287 incp1=29835.05 incp2=40425.994 incp3=51286.80078125 incp5=0 incp6=0) at(incp4=77691 incp1=29835.05 incp2=40425.994 >incp3=51286.80078125 incp5=0 incp6=0) at(incp5=77692 incp1=29835.05 incp2=40425.994 incp3=51286.80078125 incp4=77691.59375 incp6=0) at(incp5=119319 incp1=29835.05 >incp2=40425.994 incp3=51286.80078125 incp4=77691.59375 incp6=0) at(incp6=119320 incp1=29835.05 incp2=40425.994 incp3=51286.80078125 incp4=77691.59375 incp5=119319.865625) >at (incp6=150000 incp1=29835.05 incp2=40425.994 incp3 =51286.80078125 incp4=77691.59375 incp5=119319.865625)
              marginsplot

              Note I did not use the marginal option for mkspline. Should work anyway, I believe?

              Comment


              • #8
                All I have is my iPad now so I can't test Joe's code, but I agree that mcp is a wonderful program. It looks like it might be less tedious and error-prone than writing it out the way I did. However you approach it, I would suggest starting off real simple, e.g. Drop all the other variables, and confirm (by hand if necessary) that everything is looking right.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Royston's article on the mcp command is at http://www.stata-journal.com/article...article=gr0056 . Section 6.2, which I had totally forgotten about, is called "Analysis with spline models." Sooner or later the article will be free, but in the meantime it may be worth the $8.75 to purchase it. I suspect Rich S's code could be much simpler, but without having the data it is hard to be sure what is right. I think Joe's code may be pretty easy to adapt.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  Stata Version: 17.0 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    Joe's code doesn't work right but that is only because the two mkspline commands are inconsistent. You either have to use the marginal optiohn in both or not use it. This works:

                    Code:
                    * Joe Canner Code tweaked
                    use "http://www3.nd.edu/~rwilliam/statafiles/blwh.dta", clear
                    mkspline educ1 12 educ2 = educ
                    reg income educ1 educ2 jobexp i.black
                    
                    tab educ
                    gen educrange=_n in 1/21    // Make a dummy set of education variables for mcp
                    mkspline er1 12 er2 = educrange
                    mcp educ (educ1 educ2), var1(educrange (er1 er2)) margopts(predict(xb))
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    Stata Version: 17.0 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Following is a hybrid of Joe's code and Royston's. I fear I have out-cuted myself here. But, if I am right, you just need to change the lines immediately after a line that starts ***. For God's sake, make sure you believe the results though! I have inserted lots of comments but if unclear let me know. Part of the complexity stems from the fact that I am letting Stata compute the quintiles.

                      Code:
                      *** Read in the correct data set
                      webuse nhanes2f, clear
                      * Compute splines. In this case I split weight up into quintiles
                      * but you can do it however you want
                      *** Substitute your spline var for weight
                      mkspline splinevar 5 = weight, pctile di
                      * Since I didn't explicitly specify the knots, I will store their
                      * values as scalars.
                      matrix knots = r(knots)
                      scalar knot1 = knots[1,1]
                      scalar knot2 = knots[1,2]
                      scalar knot3 = knots[1,3]
                      scalar knot4 = knots[1,4]
                      * Now I will create a range variable for the var used to create splines.
                      * In this case I am dividing up into 20 equally spaced intervals but I
                      * could do it other ways
                      *** Substitute the correct var for weight
                      sum weight
                      range vrange r(min) r(max) 20
                      * Now take the range variable, and create spline vars using the same
                      * knots as before
                      mkspline vr1 `=knot1' vr2 `=knot2' vr3 `=knot3' vr4 `=knot4' vr5 = vrange
                      * Run the model! 
                      *** Replace health with your dv. Change other indep vars as appropriate. Leave the splinevars.
                      reg health splinevar1-splinevar5 i.female i.black
                      * Finally, plot the results.
                      *** Substitute your var for weight
                      mcp weight (splinevar1-splinevar5), var1( vrange (vr1-vr5)) show
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      Stata Version: 17.0 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        In my example above, for range you could specify a value as high as 69. More than that and mcp will generate an error. Basically, the value you set on range is the number of points at which mcp will compute the predicted value. The show option shows you the command mcp generated.
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        Stata Version: 17.0 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          One other thought: the way I used range might not be good if you have extreme outliers. e.g. most people make $100K or less but Bill Gates and Warren Buffett also happened to be in your sample. Royston's paper talks about different ways of specifying the range over which the calculations will be done. If you have a handful of super-rich people, you might tweak the above to something like

                          sum weight, detail
                          range vrange r(min) r(p99) 20

                          Or, you could do something like

                          range vrange 1000 50000 20

                          i.e. you can specify the high and low values for the range and how many intervals to divide them up into.
                          Last edited by Richard Williams; 01 Jul 2014, 23:04.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          Stata Version: 17.0 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Going back to Rich S's own code, see if this does what you want:

                            Code:
                            mkspline incp1 29835.05 incp2 40425.994 incp3 51286.80078125 incp4 77691.59375 ///
                                 incp5 119319.865625 incp6=totincfam
                            regress dony incp1-incp6 $control if Atot>0, vce(cluster id68)
                            range vrange 0 200000 50
                            mkspline vr1 29835.05 vr2 40425.994 vr3 51286.80078125 vr4 77691.59375 vr5 119319.865625 vr6=totincfam
                            mcp totincfam (incp1-incp6), var1(vrange (vr1-vr6)) show
                            If it is way off then I am probably doing this wrong! But it is hard to test without the actual data.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            Stata Version: 17.0 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Richard,

                              Thanks for fixing my code. I actually had the problem you mention with the max of 69 when I did an analysis of age. My age range was from 18 to 110, so I had to estimate every other age (i.e., only odds or only evens) in order to keep the number of points below 70. Since your example only had 21 possible points, I didn't think to mention that limitation.

                              Joe

                              Comment

                              Working...
                              X