Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I calculate a confidence interval around a linear spline that it interacted with another variable

    I have a large dataset (50m rows) in Stata. One of my independent variables is age, and I'm using linear splines to try to capture the complex pattern between age and my dependant variable. I also have some interactions with age, such as household type.

    My question is: how can I calculate the confidence interval around the age spline for each of the interactions with household type so that I can determine if the dependant is stat sig different by household type at a given age?

    To make the problem replicable, I've restated it below using the auto dataset:

    Code:
    sysuse auto
    mkspline weight 3 = weight, pctile
    reg price c.weight?#i.foreign mpg trunk, base
    In this case, the question would be - does weight have a different effect on price depending if it is foreign or not?

    I have tried using lincom as below to see if the CIs overlap (they do, so I conclude that weight is not stat sig different by foreign at that weight), but I don't know if it's the correct approach or how to manually create the combination of CIs that lincom does.

    Code:
    list weight* if weight == 2830
    lincom _b[0.foreign#c.weight1]*2640 + _b[0.foreign#c.weight2]*190 + _b[0.foreign#c.weight3]*0
    lincom _b[1.foreign#c.weight1]*2640 + _b[1.foreign#c.weight2]*190 + _b[1.foreign#c.weight3]*0
    Can anyone help please?
    Last edited by Rob Shaw; 15 Oct 2019, 02:34. Reason: added tags

  • #2
    Remember, overlapping confidence intervals does not mean that the difference is not statistically significant. The problem is that you hypothesis is on the difference between two random numbers. So the variance of that difference is the variance of the first parameter + the variance of the second parameter - 2*covariance between the two parameters. The confidence intervals only include information on the variances, but ignore the covariance.

    To get the graph you ask for I would use:
    Code:
    set scheme s1color
    sysuse auto, clear
    mkspline weight 2 = weight, pctile
    reg price c.weight?#i.foreign mpg trunk, base
    margins, over(weight foreign) at(mpg=20 trunk=15)
    marginsplot , xdim(weight) ciopts(recast(rarea) astyle(ci) acolor(%50))
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	89.0 KB
ID:	1520469


    However, to get the graph that will answer your question you should use (I use the if condition to ensure that I don't use extrapolation when looking at the difference between foreign and domestic cars, maybe you want to extrapolate maybe you don't, that is up to you):

    Code:
    sum weight if foreign == 0 & e(sample), meanonly
    local max = r(max)
    local min=r(min)
    sum weight if foreign == 1 & e(sample), meanonly
    local max = min(`max',r(max))
    local min = max(`min',r(min))
    margins if inrange(weight,`min', `max'), over(weight) dydx(foreign) at(mpg=20 trunk=15)
    marginsplot , ciopts(recast(rarea) astyle(ci) ) yline(0) ylab(0(2500)10000, format(%9.0gc))
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	73.6 KB
ID:	1520470

    In this case we see that the two curves in the first graph overlap, but the difference is still clearly statistically significant (the confidence interval of the difference does not include 0).
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Cross-posted at https://stackoverflow.com/questions/...hat-it-interac It was a better idea to post here, but our explicit request about cross-posting is that you tell us about it.

      Comment


      • #4
        Nick: apologies for the cross-posting

        Maarten: Thanks this is really helpful. Do you know if there is somewhere that specifies how margins actually does the dydx calculation, and specifically if it uses anything other than the regress output and the variable values? The Stata method and formulas for margins is not terrible helpful.

        One of the reasons I ask it that margins takes a very long time to run - for example, it's been running for 6 hours on my dataset already and hasn't finished.

        Many thanks
        Rob

        Comment


        • #5
          We have nothing against cross-posting, we just want you to tell us and give the link. That way we can quickly see if someone else has already given the answer we want to type at the other place. That way we can avoid duplication of effort.

          By default margins computes average predictions, marginal effects, or differences. This means that for each person in the dataset it computes the chosen statistic and than computes the average statistic over the observations. In very large samples that can get slow, in particular the computation of the confidence interval. However, to get the graph we want we don't want the average difference, instead we want the difference at fixed values of other explanatory variables. So we can speed up the computation by restricting it to only one observation per distinct value of weight and foreign. In the example dataset this won't make a noticeable difference, as that dataset is so small, but in your dataset that may do the trick. Here is how I would change my example:

          Code:
          set scheme s1color
          sysuse auto, clear
          mkspline weight 2 = weight, pctile
          reg price c.weight?#i.foreign mpg trunk, base
          
          sum weight if foreign == 0 & e(sample), meanonly
          local max = r(max)
          local min=r(min)
          sum weight if foreign == 1 & e(sample), meanonly
          local max = min(`max',r(max))
          local min = max(`min',r(min))
          
          gen byte touse = e(sample)                                       // new
          bys weight foreign touse : gen byte mark = (_n==1) if touse == 1 // new
          
          margins if inrange(weight,`min', `max') & mark == 1 ,           /// changed
               over(weight) dydx(foreign) at(mpg=20 trunk=15)
              
          marginsplot , ciopts(recast(rarea) astyle(ci) ) yline(0) ///
               ylab(0(2500)10000, format(%9.0gc))
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Maarten: thanks for that code.

            I tried it on my full data set but after 72 hours it still hadn't completed.

            I was wondering if it would be possible to calculate it manually using the covariance matrix from the regression.

            My data has 22 spline variables (representing age bands) and interacts these with 20 other variables (a mix of continuous and dummy factor variables). My regression command is

            Code:
            reg Yvar (i.sex c.IMD19score i.hholdnum 1.changehhtype 1.New_in_last_yr  1.deathinhholdlastyr 1.moved3times_2yrs 1.moved4plustimes_2yrs c.GP_patientexp1718 c.AEdrivetime)#c.( age0_1- age100_105), base vsquish noomit
            where
            Code:
            age0_1- age100_105
            are my linear splines.

            I've manually estimated the marginal effect for a specific variable over age by multiplying the coefficients by the means of the variables for a single age to get the fitted value at that age, then adding the age splines multiplied by the coefficient and the change size. So for AEdrivetime, the effect of living an extra 10 minutes further than the average for a given age is

            Code:
            fittedvalue + sum(age0_1- age100_105)*coefficients[AEdrivetime#(age0_1- age100_105)]*10
            But what I also want to get at is the confidence interval around that. I don't think it is

            Code:
            fittedvalue + sum(age0_1- age100_105)*coefficients[AEdrivetime#(age0_1- age100_105)]*10 +sum(age0_1- age100_105)*1.96*se[AEdrivetime#(age0_1- age100_105)]*10
            (for the positive CI)

            Any ideas much appreciated.

            Rob

            Comment

            Working...
            X