How can I calculate a confidence interval around a linear spline that it interacted with another variable

Rob Shaw

Join Date: May 2015

Posts: 32
#1

How can I calculate a confidence interval around a linear spline that it interacted with another variable

15 Oct 2019, 02:03

I have a large dataset (50m rows) in Stata. One of my independent variables is age, and I'm using linear splines to try to capture the complex pattern between age and my dependant variable. I also have some interactions with age, such as household type.

My question is: how can I calculate the confidence interval around the age spline for each of the interactions with household type so that I can determine if the dependant is stat sig different by household type at a given age?

To make the problem replicable, I've restated it below using the auto dataset:

Code:

sysuse auto mkspline weight 3 = weight, pctile reg price c.weight?#i.foreign mpg trunk, base

In this case, the question would be - does weight have a different effect on price depending if it is foreign or not?

I have tried using lincom as below to see if the CIs overlap (they do, so I conclude that weight is not stat sig different by foreign at that weight), but I don't know if it's the correct approach or how to manually create the combination of CIs that lincom does.

Code:

list weight* if weight == 2830 lincom _b[0.foreign#c.weight1]*2640 + _b[0.foreign#c.weight2]*190 + _b[0.foreign#c.weight3]*0 lincom _b[1.foreign#c.weight1]*2640 + _b[1.foreign#c.weight2]*190 + _b[1.foreign#c.weight3]*0

Can anyone help please?

Last edited by Rob Shaw; 15 Oct 2019, 02:34. Reason: added tags
Tags: confidence intervals, mkspline
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

15 Oct 2019, 03:30

Remember, overlapping confidence intervals does not mean that the difference is not statistically significant. The problem is that you hypothesis is on the difference between two random numbers. So the variance of that difference is the variance of the first parameter + the variance of the second parameter - 2*covariance between the two parameters. The confidence intervals only include information on the variances, but ignore the covariance.

To get the graph you ask for I would use:

Code:

set scheme s1color sysuse auto, clear mkspline weight 2 = weight, pctile reg price c.weight?#i.foreign mpg trunk, base margins, over(weight foreign) at(mpg=20 trunk=15) marginsplot , xdim(weight) ciopts(recast(rarea) astyle(ci) acolor(%50))

However, to get the graph that will answer your question you should use (I use the if condition to ensure that I don't use extrapolation when looking at the difference between foreign and domestic cars, maybe you want to extrapolate maybe you don't, that is up to you):

Code:

sum weight if foreign == 0 & e(sample), meanonly local max = r(max) local min=r(min) sum weight if foreign == 1 & e(sample), meanonly local max = min(`max',r(max)) local min = max(`min',r(min)) margins if inrange(weight,`min', `max'), over(weight) dydx(foreign) at(mpg=20 trunk=15) marginsplot , ciopts(recast(rarea) astyle(ci) ) yline(0) ylab(0(2500)10000, format(%9.0gc))

In this case we see that the two curves in the first graph overlap, but the difference is still clearly statistically significant (the confidence interval of the difference does not include 0).

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

15 Oct 2019, 03:41

Cross-posted at https://stackoverflow.com/questions/...hat-it-interac It was a better idea to post here, but our explicit request about cross-posting is that you tell us about it.
Comment
Rob Shaw

Join Date: May 2015

Posts: 32
#4

15 Oct 2019, 08:48

Nick: apologies for the cross-posting

Maarten: Thanks this is really helpful. Do you know if there is somewhere that specifies how margins actually does the dydx calculation, and specifically if it uses anything other than the regress output and the variable values? The Stata method and formulas for margins is not terrible helpful.

One of the reasons I ask it that margins takes a very long time to run - for example, it's been running for 6 hours on my dataset already and hasn't finished.

Many thanks
Rob
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#5

16 Oct 2019, 01:54

We have nothing against cross-posting, we just want you to tell us and give the link. That way we can quickly see if someone else has already given the answer we want to type at the other place. That way we can avoid duplication of effort.

By default margins computes average predictions, marginal effects, or differences. This means that for each person in the dataset it computes the chosen statistic and than computes the average statistic over the observations. In very large samples that can get slow, in particular the computation of the confidence interval. However, to get the graph we want we don't want the average difference, instead we want the difference at fixed values of other explanatory variables. So we can speed up the computation by restricting it to only one observation per distinct value of weight and foreign. In the example dataset this won't make a noticeable difference, as that dataset is so small, but in your dataset that may do the trick. Here is how I would change my example:

Code:

set scheme s1color sysuse auto, clear mkspline weight 2 = weight, pctile reg price c.weight?#i.foreign mpg trunk, base sum weight if foreign == 0 & e(sample), meanonly local max = r(max) local min=r(min) sum weight if foreign == 1 & e(sample), meanonly local max = min(`max',r(max)) local min = max(`min',r(min)) gen byte touse = e(sample) // new bys weight foreign touse : gen byte mark = (_n==1) if touse == 1 // new margins if inrange(weight,`min', `max') & mark == 1 , /// changed over(weight) dydx(foreign) at(mpg=20 trunk=15) marginsplot , ciopts(recast(rarea) astyle(ci) ) yline(0) /// ylab(0(2500)10000, format(%9.0gc))

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Rob Shaw

Join Date: May 2015

Posts: 32
#6

27 Nov 2019, 04:28

Maarten: thanks for that code.

I tried it on my full data set but after 72 hours it still hadn't completed.

I was wondering if it would be possible to calculate it manually using the covariance matrix from the regression.

My data has 22 spline variables (representing age bands) and interacts these with 20 other variables (a mix of continuous and dummy factor variables). My regression command is

Code:

reg Yvar (i.sex c.IMD19score i.hholdnum 1.changehhtype 1.New_in_last_yr 1.deathinhholdlastyr 1.moved3times_2yrs 1.moved4plustimes_2yrs c.GP_patientexp1718 c.AEdrivetime)#c.( age0_1- age100_105), base vsquish noomit

where

Code:

age0_1- age100_105

are my linear splines.

I've manually estimated the marginal effect for a specific variable over age by multiplying the coefficients by the means of the variables for a single age to get the fitted value at that age, then adding the age splines multiplied by the coefficient and the change size. So for AEdrivetime, the effect of living an extra 10 minutes further than the average for a given age is

Code:

fittedvalue + sum(age0_1- age100_105)*coefficients[AEdrivetime#(age0_1- age100_105)]*10

But what I also want to get at is the confidence interval around that. I don't think it is

Code:

fittedvalue + sum(age0_1- age100_105)*coefficients[AEdrivetime#(age0_1- age100_105)]*10 +sum(age0_1- age100_105)*1.96*se[AEdrivetime#(age0_1- age100_105)]*10

(for the positive CI)

Any ideas much appreciated.

Rob
Comment

Announcement

How can I calculate a confidence interval around a linear spline that it interacted with another variable

Comment

Comment

Comment

Comment

Comment