Fitting a polynomial model on longitudinal count data

Arman Aksoy

Join Date: May 2019

Posts: 13
#1

Fitting a polynomial model on longitudinal count data

27 Jan 2020, 11:38

Hello,

I have an unbalanced longitudinal dataset with a count dependent variable. (it is not zero inflated but is overdispersed)
I'd like to fit three (3) continous predicators that are reported to have curvilinear effect on the predicted variable.
The litterature also reports that the three predicators are interacting with each others.
The model will also include some other control variables.

I try to demonstrate the effect of three way interactions between the predicators and describe the eight (8) cases of high/low values (p1:low, p2:low, p3:low; p1:high, p2:low, p3:low;...)

If I try to code this myself I calculated that I should end up with over a million different regressions since I'd have to test every interaction term and their combinations.

The closest I could find to fit some model with stata is the "bfit" function but it doesnt support panel or negative binomial.

Could someone point me toward a solution to fit the best model ?

Best Regards,

Arman
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

27 Jan 2020, 12:00

If I try to code this myself I calculated that I should end up with over a million different regressions since I'd have to test every interaction term and their combinations.

I think your calculations have gone wrong. You have three variables each in a low high category. That makes for 8 combinations. Now, even if you had to do a regression for every possible subset of those 8, that would only be 2⁸ = 256 regressions. Moreover, you don't have to do all that, because in general, interaction models require that any interaction term that is included must be accompanied by all sub-interactions. So, for example including p1#p2#p3 but omitting p1#p2 is not appropriate. So, putting aside issues of polynomials, the number of different interaction models to test is really just {p1##p2##p3; p1##p2 p1##p3 p2##p3; p1##p2 p1##p3; p2##p3 p1##p3; p1##p2 p2##p3; p1##p2; p1##p3; p2##p3; no interactions}. If you want to take this to a worst-case scenario, the "no interactions" model encompasses 8 more possibilities for models involving all possible subsets of {p1, p2, p3} without interaction terms. So that's nowhere near a million. Either your calculations are way off or you have something entirely different in mind.

That said, I am not at all a fan of fitting all possible models to a set of data: this is basically noise mining and the results are usually not reproducible. There is also the issue of how one might select the "best" model from among all of those after seeing the results, and every way that I am aware of has serious problems. In your post you refer to a literature, which seems to suggest that the p1##p2##p3 model is appropriate--so why are you even looking at the others? Do you have a good critique of that literature to suggest it is wrong?

I don't quite get how you plan to model polynomials if these variables take on only two values, low and high. None of that makes sense since any power of a 0/1 variable is just equal to the original 0/1 variable.
1 like
Comment
Arman Aksoy

Join Date: May 2019

Posts: 13
#3

27 Jan 2020, 17:38

Hello M. Schechter,

Thank you for your answer !

I'll try to answer your conserns in the order they appeare in your answer :

1) I shouldnt have written that low/high idea as it was a last minute pitch and figured it would create confusion.
The variables are continuous between 0 and 1.

2) I was not sure about the interaction terms so what I understand from your comment is that if my predicators are "x", "y", and "z" these are the right models that I can test:
x y
x y x*y
x z
x z z*x
y z
y z y*z
x y z x*y
x y z x*z
x y z y*z
x y z x*y x*z
x y z x*y y*z
x y z x*z y*z
x y z x*y x*z y*x
x y z x*y x*z y*z x*y*z
As you can notice I test models by removing one of the predicators too. However, I cannot test these as they're not appropriate polynomial models:
x y z x*y x*y*z (missing x*z y*z)
x y z x*z x*y*z (missing x*y y*z)
x y z y*z x*y*z (missing x*z x*y)
x y z x*y x*z x*y*z (missing y*z)
x y z x*y y*z x*y*z (missing x*z)
x y z x*z y*z x*y*z (missing x*y)
Is that correct ? If so then yes my calculations are way off !

3) The reason I was ending up with over a million is that my predicators MIGHT be curvilinear so the complete model is as follow :
x x2 y y2 z z2 x*y x2*y x*y2 x2*y2 x*z x2*z x*z2 x2*z2..... x2*y2*z2
Of course I would remove the squared values and the interactions as needed (if I remove x2 then I remove all interactions with it too) but the concern pointed out in 2) holds and was the reason behind it.

4) I'm planning to use AIC and BIC values to measure best fit. I know that both have their drawbacks and can supplement with other methods if necessary.

5) The litterature on the subject is divided. While some indicate that the 3 predicators have curvilinear effect on similar dependents, others indicate linear effects. There are some studies for double interactions but the triple interaction is a theoritical model which is not backed by empirical studies (which is where I try to come in).

6) The variables are continuous between 0 and 1. I should definitely not have included that case study type matrix with low/high, a confusing last moment pitch...

Hope this made the case a bit clearer,

Thank you,

Arman

Last edited by Arman Aksoy; 27 Jan 2020, 18:03.
Comment
Arman Aksoy

Join Date: May 2019

Posts: 13
#4

27 Jan 2020, 19:57

P.S : The combination of dependent/independent variables I use has never been studied in the litterature
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

28 Jan 2020, 10:47

OK, that's clearer. Your understanding about which models are admissible and which are not is correct. I would add to that, now that the role of polynomials has been clarified, that a similar rule applies: if you include a quadratic term, the linear term must also be included. If you include a cubic term, the linear and quadratic must also be included. The inclusion of the n'th power requires the inclusion of all lower powers.

I do not know of any existing programs that will generate and run all of these models for you. You will have to "roll your own" for that, I think. Your work will be simplified by the use of factor variable notation. So, x##y##z automatically includes x#y#z, x#y, x#z, y#z, x, y, and z. So you only have to specify the highest level interactions. Similarly, x² can be represented as x##x and you will automatically get both x#x an x. So it will be complicated, but a few loops should be able to do it for you. Something like this will generate all possible interactions up to three way, of all possible linear and quadratic terms in x, y, and z. If you need to go to cubic, basically just replace the 2's by 3's in the first three -forvalues- command. And ultimately, instead of -display-ing the model you will want to put a regression command there, followed by whatever you will do with the results.

Code:

clear* cls forvalues x_exp = 0/2 { forvalues y_exp = 0/2 { forvalues z_exp = 0/2 { local model foreach u in x y z { local `u'_term forvalues i = 1/``u'_exp' { local `u'_term ``u'_term'##`u' if substr("``u'_term'", 1, 2) == "##" { local `u'_term = substr("``u'_term'", 3, .) } } if "``u'_term'" != "" { local model `model'##``u'_term' } if substr(`"`model'"', 1, 2) == "##" { local model = substr("`model'", 3, .) } } display _newline `x_exp', `y_exp', `z_exp' display `"`model'"' } } }
Comment
Arman Aksoy

Join Date: May 2019

Posts: 13
#6

28 Jan 2020, 12:33

Oh nice ! Thank you for the code ! I'll adapt it as needed if I decide to use higher powers.
Comment

Announcement

Fitting a polynomial model on longitudinal count data

Comment

Comment

Comment

Comment

Comment