Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • splines vs tertiles in regression

    I'm finessing some analysis looking at the effect of an enviornmental biomarker (EB) on anthropometric outcomes in children (AO). Let me preface by saying I am an expert SPSS user who is often asked elementary syntax questions that I readily help people with. So I understand if you think this is elementary, but I really have searched online and in the forums and can't find a response appropriate to my query. Reader - kindly read on and please help if you can. Thank you.

    Originally I ran the analysis in my comfort zone of linear regression. But my colleagues and I started thinking that perhaps the linear relationship is not the best way to analyze.
    So I used tertiles and ran linear regression using dummy variables. Because of the size of my sample, tertile cuts are as small as I should go to maintain over n=100 in each tertile.
    But we started to see something in the third tertile that we thought might be driving the relationships we're seeing. So I was advised to use spline analysis in STATA since SPSS cannot handle this properly. The way I understood it, I was told to use the spline to tell me the best cutpoints (knots) to use instead of the tertiles.

    My interpretation of this (perhaps grossly naive and incorrect) was that I should use the knots in the same way I used the tertile cuts - make dummy variables and run linear regression. But the only way I could get STATA to make the knots without any input from me was to use cubic splines. The problem with that being cubic splines of course cannot have fewer than 3 knots - which then brings my N to under 100 in the segment before the 1st knot and after the 2nd knot (more like the N hovers around 40 for those 2 segments).

    Now I'm thinking what I was supposed to do was actually run a spline regression syntax which incorporates tertiles in STATA. Is there a consensus on what is the best of the 3 spline syntax modes for this? I've gotten as far as
    mkspline m1sptert 3 = EB, pctile displayknots
    But this gives me 375 values in each tertile. I actually am not clear on what the new variables are even supposed to represent.

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    m1sptert1 | 375 .1679392 .3416748 -1.76026 .3310139
    m1sptert2 | 375 .338585 .2955811 0 .6702747
    m1sptert3 | 375 .2013302 .440995 0 2.925424

    I would so greatly appreciate being pointed in the right direction and especially if sample code could be offered. I'm already behind on my deadline and mostly it's because I'm very muddy on the spline analysis and I'm such a novice at STATA.

  • #2
    I am a fairly experienced Stata user who hasn't used SPSS for about 30 years, so we may balance.

    mkspline
    creates a bundle of new predictors all of which are defined for all observations you used. To get a feeling for what you just did, plot what you created, e.g.

    Code:
    scatter m1sptert? m1sptert
    When you say spline versus tertiles, that misdescribes the question which is (regression using indicator variables using tertiles) versus (regression using spline predictor variables using tertiles) versus (any other suitable way of handling nonlinearity).

    There can't be a consensus on what's best, because it depends on the problem. Choice of knots can be central to a problem whenever there is an expectation of some threshold at which behaviour changes. This could be, in some social problems, an age at which something happens or is allowed, e.g. leaving school or driving or drinking legally, which might be a crucial level separating different regimes of behaviour. This kind of problem is often addressed with linear splines at first. Conversely, it should often be true that choice of knots is not especially crucial whenever cubic splines are used because the whole point, or much of it, is that the created function and its rate of change are changing smoothly around a knot.

    Given what you say, I'd leap for cubic splines as I would not guess that there are critical levels of contaminant that have threshold effects, or at linear segments, rather that more means worse (or better) in some smooth but not necessarily simple manner. A while back I created rcspline as a sandbox for exploring relationships between a single response and a single predictor. For example,

    Code:
    ssc inst rcspline
    sysuse auto
    rcspline mpg weight
    rcspline mpg weight, nknots(3)
    This example shows that having only 74 values (a much smaller sample than you have) is compatible with cubic splines. Your wording suggests that you are thinking that using spline predictors somehow means local fitting, but all of the data contributes to a fit; it's just that there are constraints on what is fitted. That said, my wild guess is that your data are much noisier than this single example. (Also, you may have other predictors that you don't mention here.)

    I would not use tertile knots with cubic splines; it's my experience that the default knots work well if anything does; if not, splines are the wrong choice.

    N.B. rcspline is just an instructional tool. For serious work, running the regression directly is essential.

    N.B. Stata, not STATA please. (FAQ Advice Section 18).

    Comment


    • #3
      Thank you so much Nick. You have set my mind more at ease. I feel like I can move forward with what I'm doing. I truly appreciate your taking the time to write such a coherent explanation. And I appreciate the edification regarding "Stata" - I never knew (and I did take a peak at the FAQs).

      Comment

      Working...
      X