Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing for (inverse) Ushaped association between binary dependent variable and independent variables

    Hello all,

    as a complete Stata beginner, I stumbled upon a huge problem in my Data analysis.
    I am analyzing the effect of sustainability framing on Crowdfunding success. Now I have some independent variables, such as "gif_images_count" where I suspect a non-linear relationship. I would now like to test for an inverse-U-shaped association/relationship. In addition, I would very much like to draw the association. I am using Stata14.0 and I have tried several codes, as the following should illustrate.


    geeqm dependent_variable independent_variable, distribution(binomial) link(logit)
    --> geeqm success gif_images_count, distribution(binomial) link(logit)

    logistic success i.gif_images_count##c.gif_images_count

    predict success_prob, pr
    twoway scatter success_prob gif_images_count, mcolor(black) msymbol(circle) || function gif_images_count = a*gif_images_count^2 + b*gif_images_count + c, range(0 100)


    For the first code that I tried, there the package was somehow not available within my Stata version.
    The second one itself "works", as no error message appears, however, spits out an enormously long table and I would very much like to have it compromised in a nice way, as well as have a scatterplot, which shows the relationship in a nice way.

    The last code I tried, just says that many options are not available.
    I tried to integrate as much as possible from the FAQ, and I am very sorry, if this is somewhere against the guidelines. I was not able to find anything that helps me with my problem, that I understand.

    Thank you very much for your help.
    Kind regards,
    Antonia

  • #2
    Change
    Code:
    logistic success i.gif_images_count##c.gif_images_count
    to
    Code:
    logistic success c.gif_images_count##c.gif_images_count
    because you want to treat gif_images_count as a continuous variable, not as a categorical variable which is what is implied by the i. prefix.

    See
    Code:
    help factor variables
    for documentation on the factor variable notation.

    Comment


    • #3
      logistic success i.gif_images_count##c.gif_images_count
      does not make sense as a model. For the purpose you describe, it would seem that you need to treat gif_images_count as a continuous variable. Hence,

      Code:
      logistic success c.gif_images_count##c.gif_images_count
      will get you started in the right direction.

      That said, be aware that an apparent good fit of a quadratic regression model does not guarantee that there is a U or inverse-U shaped relationship. Run this code to see what I mean:
      Code:
      clear*
      set obs 100
      gen x = _n
      gen y = log(x)
      
      regress y c.x##c.x
      predict quadratic
      label values quadratic quadratic_fit
      
      graph twoway line quadratic y x
      As you can see from the graph, and probably remember from high school mathematics, the logarithm function's graph is by no means inverse U-shaped. Yet a quadratic fit over a finite range (and it doesn't have to be 1 to 100, pretty much any finite range will do) yields a fit with a very high R2 and highly statistically significant coefficients. Yet in the graph you can see that the quadratic curve substantially misrepresents the shape of the actual relationship. Now, nevertheless, the quadratic model here might still be useful for many purposes, even though it is clearly wrong. So you should really be thinking carefully about whether you really are concerned specifically about an inverse-U, or just more generally trying to capture non-linearity in the relationship. I think that when most people refer to inverse_U or U, they really just want to capture non-linearity, or perhaps establish a "diminishing returns" relationship.

      If you are really interested in an inverse-U shaped relationship, you have a much more difficult task ahead of you. You need to establish that the relationship is increasing at low values of gif_images_count and decreasing at high values, and that there is a unique maximum somewhere in between. That is actually a very challenging statistical task because you need a defensible way of defining "low values" and "high values."

      I should add that all of this becomes even more complicated with logistic regression because the logistic link function is non-linear and adds additional curvilinearity of its own to the relationship which might create an appearance of non linearity or obscure an existing non-linearity that might exist if you looked at predicted log-odds instead of predicted probability. So this is really a very fraught area that requires careful attention to details and high clarity of thought.

      function gif_images_count = a*gif_images_count^2 + b*gif_images_count + c
      This also doesn't make sense, nor does it fit with what -twoway function- does. If you are looking to see a quadratic graph fit to the success probability gif_images_count relationship, use -twoway qfit-. See -help twoway qfit- for details.

      Added: Crossed with #2.

      Comment

      Working...
      X