Clustering in logistic regression

Ciara Lusin

Join Date: Jan 2017

Posts: 30
#1

Clustering in logistic regression

24 Aug 2017, 10:54

Hi
I have an individual level data, around two million observations. My dependent variable is a dummy variable =1 if the individual is a migrant and 0 otherwise . I want to estimate the impact of famine on migration controlling for several variables, including categorical variables on provinces (there are 3 provinces in regression).The measure of famine is the change in the growth rate in potato production, and the numbers are negative. When I cluster the standard errors at county level, my results become insignificant. There are 32 counties, so 32 clusters. When I do the clustering at a lower level like district r individual level, the results are significant. However, I also need to account for within county similarities so I need to do the county -level clustering as well. What do you suggest I can do ? Is there any other specification I can use? I tried profit, but it is still the same and actually seems legit works better for my data.

Thanks a lot.
Karla
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29981
#2

24 Aug 2017, 11:10

So, what you are observing is likely a sample size issue. When you speak of "clustering" your standard errors, I assume you are referring to using vce(cluster county), etc. In using vce(cluster county), you are telling Stata that you actually only have 32 independent observations and that much of the data is, at least to some extent, redundant. The standard errors are then inflated to reflect that fact, and, in this case, they are inflated to the point that your results lose statistical significance. When you use the smaller district level, you have more clusters, hence more putatively independent observations, with less redundancy, so you get smaller standard errors and the results end up being statistically significant.

Unfortunately, the use of vce(district) is misleading Stata into providing you with overly optimistic results. The way vce(cluster robust) works, the observations must be independent across clusters, but may be correlated within them. This would be untrue at the district level: there is correlation among districts that lie within the same county.

But you have another issue here. Your data, as described, have a nested hierarchical structure which -logit- does not respect. You have, at least, three levels: individual observations within districts within counties. I would use -melogit- to model this data. The use of a one-level model with -logit- is likely to be seriously inefficient (unless the effects at the district and county levels are really minimal--in which case you also have no need of -vce(cluster...)-). By using a multi-level model you will absorb some part, perhaps a very substantial part, of the variance that -logit- cannot model, into the random intercepts, leaving you with smaller standard errors and more statistical power.
2 likes
Comment
Ciara Lusin

Join Date: Jan 2017

Posts: 30
#3

25 Aug 2017, 06:15

Thanks Clyde! I will try it )))
Comment
Ciara Lusin

Join Date: Jan 2017

Posts: 30
#4

26 Aug 2017, 07:01

Hi Clyde, I have used melodic and I get statistical significance but again with clustering at county level the significance disappears. What can be done in this case?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29981
#5

26 Aug 2017, 11:07

So, using -melogit- you have at least eliminated the inefficiency that results from disregarding the nested structure. But, in the end, because you have only 32 counties, and substantial redundancy in your data, the data simply do not support a sufficiently precise estimation of the model parameters for you to distinguish them from zero. The only thing you can do is get data on more counties. Or perhaps there is some other variable that is a better indicator of famine than the growth of the potato harvest. But with your current data, the results are what they are.
Comment
Ciara Lusin

Join Date: Jan 2017

Posts: 30
#6

26 Aug 2017, 12:27

Thanks for the response! I just realised I made a mistake - I missed calculating the marginal effects after using melogit. My dependent variable is a dummy, measure of famine is string and the others controls are either categorical. What command would you suggest for marginal effects ? I tried mfx but it doesn't work, also mfx2... I get an error that my disk is full but it is npt.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29981
#7

26 Aug 2017, 12:31

The best command for this purpose is the official Stata command -margins-. In order to use it, however, you must have used -factor variable- notatioin in your logistic regression. So if you didn't, go back and re-run the regression using factor-variable notation. If you are not familiar with that, read -help fvvarlist-. For an introduction to the -margins- command, I recommend Richard Williams' excellent http://www.stata-journal.com/sjpdf.h...iclenum=st0260. That will give you all you need to know about it for present purposes. If you want to go on and learn about the many other things that -margins- can do, read the -margins- chapter in the PDF manuals that came with your installation.
Comment

Announcement

Clustering in logistic regression

Comment

Comment

Comment

Comment

Comment

Comment