Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Suggestions on linear splines

    Dear all,
    I would like to have your feedback on some general (rather than Stata-related) questions with regard to linear splines.

    1. My current approach is to explore the shape of the relationship between a continuous variable x and an outcome y by first regressing y on a restricted cubic spline (RCS) of x with predefined number and placement of knots; if the non-linear test (e.g. an overall test on the non-linear components of the spline) is significant, then proceed to linear spline approach, otherwise stick with linear regression of y on x. Would you agree?
    2. Regarding the subsequent linear spline analysis, I see that some authors select one knot based on visual ispection of the RCS fit; I find this unfeasible when you have multiple variables and also not sure this is methodologically correct, so would stick with two prespecified knots (e.g. 30th and 60th percentile). Suppose one is interested in testing the interaction between x and z on y, with z as a binary variable, and we have already established through RCS that the relation between x and y is non-linear. Would you trim x to values that are common to the levels of y, or analyze as is? Also, would you keep the knots for the linear spline as found in the overall sample of should those be specific to the levels of z?

    Thanks,
    Manuel

  • #2
    1) I don't really see a benefit to using restricted cubic splines first to test for non-linearity. Restricted cubic splines and linear splines serve slightly different purposes: They both allow for non-linearity, but restricted cubic splines' strength is more in creating visually pleasing smooth curves, while linear splines' strength is more in the easy interpretation of it's parameters. You can easily test for non-linearity in a linear spline: in the default parameterization by mkspline you would use test to test whether the coefficients of all the spline terms are equal. I don't see the added value of doing an additional test using a different model.

    2) Knot placement is a really big deal when you use a linear spline. Whether or not your model can capture the non-linearity in the data depends critically on where you place the knots. You want the knots to be there where a substantial change in the slope happens, otherwise a linear spline cannot capture it. So just setting the knots at the 30th and 60th percentile is a very bad idea. You can use a restricted cubic spline for exploring where the knots might be, because it is less sensitive to knot choice. It is fairly common, but far from always, that I have a theoretical reason for the knots, e.g. when I have age that could be age when a person becomes an adult (18) and retirement age, or when dealing with hours per week working you could choose 40 to separate part-time from over-time working, etc. When that is not the case I often just use a the maximum or minimum from a quadratic model as a first guess on where the knot could be. But I would still closely examine it and very often change that value maybe on theoretical grounds or closer inspection of the data.

    The point is, I would use linear splines if I was interested in interpreting the coefficients. That is it's strength: you get a non-linear effect but with coefficients that are almost as easy as just linear effects. That knot does play into the interpretation, so it needs to make sense. So if I use a linear spline, I want to be in control of where the knots are and not some rule of thumb. If the variable you want a non-linear effect for is not the variable of primary interest, i.e. it is "just" a control variable, then a linear spline would not be my first choice, because it is so sensitive to knot placement. This is not absolute of course. For example, if I were already using linear splines for another variable, then I typically would not use two different methods of including non-linearity in the same model, and would thus also use linear splines for my control variables when necessary.

    With interactions that just depends on the situation, for example are the sub-groups big enough to warrant separate inspection. You typically need a lot of data to find and accurately describe non-linearity. If you want to find the knots in each sub-group separately, then you first need to make sure you have enough data in each sub-group to support such an analysis. That is the typical limitation that I run into, but your situation may differ.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      @Maarten Buis, thank you so much for those insights and I totally agree with you with regard to the advantages of each approach. In fact I'm planning to present RCS plots but resort to linear splines to be able to provide coefficients. Regarding some of the points you raise:

      You can easily test for non-linearity in a linear spline: in the default parameterization by mkspline you would use test to test whether the coefficients of all the spline terms are equal. I don't see the added value of doing an additional test using a different model
      I like this approach, so let's assume we have 2 knots and thus 3 linear splines, would you first test the overall association between x and y and then proceed to test non-linearity (provided the overall test is significant) and finally comparing the individual slopes between each other?
      Also, any thoughts on extreme values of x? Many authors trim the extreme percentiles when analyzing restricted cubic splines, would this be an issue also with linear splines?

      You can use a restricted cubic spline for exploring where the knots might be, because it is less sensitive to knot choice
      So if the model is a logistic regression, would you inspect the plot of predicted probabilities (e.g. with adjustrcspline)? I am quite concerned with choosing the knots based on visual inspection. I read that the nl command can be used to estimate the optimal inflection point (https://stats.oarc.ucla.edu/stata/fa...se-regression/), would this be feasible for a more "objective" approach? And if the estimated inflection point is associated with a non-significant statistical test, this could be taken as proof that the association is linear?
      Last edited by Manuel Ferraro; 17 Sep 2022, 04:02.

      Comment


      • #4
        Originally posted by Manuel Ferraro View Post
        In fact I'm planning to present RCS plots but resort to linear splines to be able to provide coefficients.
        That is likely to cause more confusion, then it helps. For a single paper, I would pick just one of these and stick to just that method, and live with its limitations.

        Originally posted by Manuel Ferraro View Post
        let's assume we have 2 knots and thus 3 linear splines, would you first test the overall association between x and y and then proceed to test non-linearity (provided the overall test is significant) and finally comparing the individual slopes between each other?
        Testing is used way too much. Statistical tests have their place, but that place is much much much more limited than it is currently used for. I am very skeptical about its use in model choice. You just tend to stack test upon test upon test, and who knows what your p-values mean after all that... Now if the existence of non-linearity is the primary hypothesis of interest, then testing makes more sense. Otherwise I would just avoid it altogether, and just rely on the good old interoccular trauma test(*).

        Originally posted by Manuel Ferraro View Post
        Also, any thoughts on extreme values of x? Many authors trim the extreme percentiles when analyzing restricted cubic splines, would this be an issue also with linear splines?
        Any automated outlier deletion is the work of the devil, and anyone proposing its use should be burned at the stake before being (politely ofcourse) asked to leave the profession.


        Originally posted by Manuel Ferraro View Post
        So if the model is a logistic regression, would you inspect the plot of predicted probabilities (e.g. with adjustrcspline)? I am quite concerned with choosing the knots based on visual inspection. I read that the nl command can be used to estimate the optimal inflection point , would this be feasible for a more "objective" approach?
        With those nl approaches, you typically have to feed it good starting positions, and guess where those come from... Also that does not work well with logit.

        Originally posted by Manuel Ferraro View Post
        And if the estimated inflection point is associated with a non-significant statistical test, this could be taken as proof that the association is linear?
        No, absolutely not. Always keep in mind what the null hypothesis of a test is. In this case the null hypothesis is that the inflection point happens at x=0, not that there is no inflection point. Also this is a very tough thing to estimate, so the power of your test if probably extremely low. So a non-significant test just tells you don't know if the null hypothesis is false or not, it certainly does not tell you that the null hypothesis is true.

        (*) inter = between, interoccular = between the eyes, interoccular trauma = it hits you between the eyes, so the interoccular trauma test means that you just look and see if the pattern is so obvious that it "hits you between the eyes".
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          With regard to the sequential testing, I seem to remember that it is proposed by Harrell in his Regression modeling strategies, that you should first reject the overall hypothesis that the curve is flat before exploring non-linearity. Although I like your proposed interocular trauma approach, I noticed that for my project (I have several independent variables), if I select an inflection point based on visual inspection of RCS, the p-value for the difference between the two slopes ends up being almost always significant, even for those variables for whom non-linearity tests don't react. I guess this is expected since I'm just picking the optimal point in the dataset, so I think a more robust and a priori set of rules would be more beneficial.
          For instance, look at the plot below:

          Click image for larger version

Name:	Senza titolo.jpg
Views:	1
Size:	18.9 KB
ID:	1682354


          The non-linearity p-value based on both RCS and linear splines at 30/60 percentile would be highly non-significant, however if I place the inflection point at about the second major tick, the p-value for the marginal effect of the second slope would become significant.

          And yes, this project is really focusing on linearity vs non-linearity and potential inflection points for non-linear associations.

          Thanks,
          Manuel
          Last edited by Manuel Ferraro; 17 Sep 2022, 09:10.

          Comment


          • #6
            you write, "I seem to remember that it is proposed by Harrell in his Regression modeling strategies, that you should first reject the overall hypothesis that the curve is flat before exploring non-linearity." - but I do not think this is true and the use of pre-test estimators is counter to Harrell's general philosophy; I think a better summary of what he believes is (p.64 of the second edition of his book on "Regression modeling strategies"), "In the vast majority of studies, however, there is every reason to suppose that all relationships involving nonbinary predictors are nonlinear. In these cases, the only reason to represent predictors linearly in the model is that there is insufficient information in the sample to allow us to reliably fit nonlinear relationships."

            Comment


            • #7
              Originally posted by Rich Goldstein View Post
              you write, "I seem to remember that it is proposed by Harrell in his Regression modeling strategies, that you should first reject the overall hypothesis that the curve is flat before exploring non-linearity." - but I do not think this is true and the use of pre-test estimators is counter to Harrell's general philosophy; I think a better summary of what he believes is (p.64 of the second edition of his book on "Regression modeling strategies"), "In the vast majority of studies, however, there is every reason to suppose that all relationships involving nonbinary predictors are nonlinear. In these cases, the only reason to represent predictors linearly in the model is that there is insufficient information in the sample to allow us to reliably fit nonlinear relationships."
              I double checked and I remembered correctly. Page 32:

              for a continuous predictor for which linearity is not assumed, all terms involving the predictor should be tested simultaneously to check whether the factor is associated with the outcome. This test should precede the test for linearity and should usually precede the attempt to eliminate nonlinear terms. [...] If this [...] test is insignificant, it is dangerous to interpret the shape of the fitted spline function because the hypothesis that the overall function is flat has not been rejected.
              In my specific case, I think it would be beneficial to have some kind of preliminary test to authorize me to look for inflection points, rather than leaving it to my eyes.
              Last edited by Manuel Ferraro; 17 Sep 2022, 09:47.

              Comment


              • #8
                my belief about the meaning of the various quotes from Frank Harrell are not particularly relevant so I wrote to Frank and here is his response: "The reason I emphasized the overall chunk test is so that people wouldn't test individual terms which are quite arbitrary."

                Comment


                • #9
                  I would agree with that, that's why I would proceed with an overall test before checking chunks (e.g. non-linear terms)

                  Comment

                  Working...
                  X