Testing for (inverse) Ushaped association between binary dependent variable and independent variables

Antonia Borg

Join Date: Dec 2022

Posts: 3
#1

Testing for (inverse) Ushaped association between binary dependent variable and independent variables

17 Dec 2022, 10:21

Hello all,

as a complete Stata beginner, I stumbled upon a huge problem in my Data analysis.
I am analyzing the effect of sustainability framing on Crowdfunding success. Now I have some independent variables, such as "gif_images_count" where I suspect a non-linear relationship. I would now like to test for an inverse-U-shaped association/relationship. In addition, I would very much like to draw the association. I am using Stata14.0 and I have tried several codes, as the following should illustrate.

geeqm dependent_variable independent_variable, distribution(binomial) link(logit)
--> geeqm success gif_images_count, distribution(binomial) link(logit)

logistic success i.gif_images_count##c.gif_images_count

predict success_prob, pr
twoway scatter success_prob gif_images_count, mcolor(black) msymbol(circle) || function gif_images_count = a*gif_images_count^2 + b*gif_images_count + c, range(0 100)

For the first code that I tried, there the package was somehow not available within my Stata version.
The second one itself "works", as no error message appears, however, spits out an enormously long table and I would very much like to have it compromised in a nice way, as well as have a scatterplot, which shows the relationship in a nice way.

The last code I tried, just says that many options are not available.
I tried to integrate as much as possible from the FAQ, and I am very sorry, if this is somewhere against the guidelines. I was not able to find anything that helps me with my problem, that I understand.

Thank you very much for your help.
Kind regards,
Antonia
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

17 Dec 2022, 11:01

Change

Code:

logistic success i.gif_images_count##c.gif_images_count

to

Code:

logistic success c.gif_images_count##c.gif_images_count

because you want to treat gif_images_count as a continuous variable, not as a categorical variable which is what is implied by the i. prefix.

See

Code:

help factor variables

for documentation on the factor variable notation.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#3

17 Dec 2022, 11:19

logistic success i.gif_images_count##c.gif_images_count

does not make sense as a model. For the purpose you describe, it would seem that you need to treat gif_images_count as a continuous variable. Hence,

Code:

logistic success c.gif_images_count##c.gif_images_count

will get you started in the right direction.

That said, be aware that an apparent good fit of a quadratic regression model does not guarantee that there is a U or inverse-U shaped relationship. Run this code to see what I mean:

Code:

clear* set obs 100 gen x = _n gen y = log(x) regress y c.x##c.x predict quadratic label values quadratic quadratic_fit graph twoway line quadratic y x

As you can see from the graph, and probably remember from high school mathematics, the logarithm function's graph is by no means inverse U-shaped. Yet a quadratic fit over a finite range (and it doesn't have to be 1 to 100, pretty much any finite range will do) yields a fit with a very high R²and highly statistically significant coefficients. Yet in the graph you can see that the quadratic curve substantially misrepresents the shape of the actual relationship. Now, nevertheless, the quadratic model here might still be useful for many purposes, even though it is clearly wrong. So you should really be thinking carefully about whether you really are concerned specifically about an inverse-U, or just more generally trying to capture non-linearity in the relationship. I think that when most people refer to inverse_U or U, they really just want to capture non-linearity, or perhaps establish a "diminishing returns" relationship.

If you are really interested in an inverse-U shaped relationship, you have a much more difficult task ahead of you. You need to establish that the relationship is increasing at low values of gif_images_count and decreasing at high values, and that there is a unique maximum somewhere in between. That is actually a very challenging statistical task because you need a defensible way of defining "low values" and "high values."

I should add that all of this becomes even more complicated with logistic regression because the logistic link function is non-linear and adds additional curvilinearity of its own to the relationship which might create an appearance of non linearity or obscure an existing non-linearity that might exist if you looked at predicted log-odds instead of predicted probability. So this is really a very fraught area that requires careful attention to details and high clarity of thought.

function gif_images_count = a*gif_images_count^2 + b*gif_images_count + c

This also doesn't make sense, nor does it fit with what -twoway function- does. If you are looking to see a quadratic graph fit to the success probability gif_images_count relationship, use -twoway qfit-. See -help twoway qfit- for details.

Added: Crossed with #2.
1 like
Comment

Announcement

Testing for (inverse) Ushaped association between binary dependent variable and independent variables

Comment

Comment