Another error with lrtest after melogit

Dana Fletcher

Join Date: Apr 2018
Posts: 20

Another error with lrtest after melogit

14 Mar 2019, 14:39

Dear StataListers,

I am posting a follow up question to the one posted by Michael Evangelist on 11 June 2018 at 13:41, entitled "Error with lrtest after melogit". I have a similar problem, but with a different error message. The error occurs with "lrtest" as well as with the "lrtest, force" option recommended in the referenced post. The error can be seen in the last two lines of output below. It also occurs with or without added independent variables.

I'm using Stata/SE 15.1 updated as of today (3/14/2019) for Windows 64-bit.

Can anyone help me understand how to get the lrtest command to execute for this model or why it's not currently? Much of what I have read indicates that it should appear by default.

Code:

. melogit readmit30  cohort4num || new_hosp_num:, vce(robust) or

(iteration log omitted)

Mixed-effects logistic regression               Number of obs     =     33,619
Group variable:    new_hosp_num                 Number of groups  =      2,192

                                                Obs per group:
                                                              min =          1
                                                              avg =       15.3
                                                              max =        195

Integration method: mvaghermite                 Integration pts.  =          7

                                                                       Wald chi2(1)      =     221.17
Log pseudolikelihood = -15758.173               Prob > chi2       =     0.0000
                                            
                                               (Std. Err. adjusted for 2,192 clusters in new_hosp_num)
------------------------------------------------------------------------------------------------------------------
                            |                     Robust
   readmit30         | Odds Ratio   Std. Err.      z          P>|z|            [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------------------------
  cohort4num       |   .5597914   .0218392   -14.87   0.0000      .518583    .6042745
       _cons           |   .2436699   .0038726   -88.84   0.0000     .2361967    .2513795
-------------+----------------------------------------------------------------------------------------------------
new_hosp_num  |
   var(_cons)        |   .0145663   .0105003                                   .0035461    .0598338
--------------------------------------------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
Note: _cons estimates baseline odds (conditional on zero random effects).

.
. lrtest
In the old syntax, the unrestricted model defaulted to a model saved under the name 0.  This model was not found.
r(198);

Many thanks for your insight.

Last edited by Dana Fletcher; 14 Mar 2019, 14:41.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

14 Mar 2019, 15:19

Well, -lrtest- does a likelihood ratio test contrasting two models, one of which is nested in the other. You need to store the estimates for those models, and then the -lrtest- command must give the names of the stored estimates. You can omit one of those names, and Stata will, by default, use the currently active estimation results for that one. But you haven't specified anything at all, so Stata has no idea what models you want to compare.

Let me speculate that what you want to do is a likelihood ratio test for the variable cohort4num. To do that, you must run your model twice, once with cohort4num in the model, and once without, and then do the -lrtest- comparing those. It is also important to make sure that you use the same estimation sample for both. Since adding variables risks having additional observations deleted from estimation due to missing values on the added variables, the simplest approach is to run the full model first, then repeat the model on the resulting estimation sample. All in all, it would look something like this:

Code:

melogit readmit30 cohort4num || new_hosp_num:, vce(robust) or estimates store model_with melogit readmit30 if e(sample) || new_hosp_num:, vce(robust) or lrtest model_with
2 likes
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#3

14 Mar 2019, 18:46

Thanks so much for that clarification!

I saw the estimates store syntax, but wasn't sure exactly how to put it into play or how much Stata did behind the scenes versus what I needed to tell it to do, so seeing it is very helpful.

Also thank you for commenting on the order of running the two models. I thought I imputed all my missings, but apparently there are a few (~30) that I have missed and will need to go find--without your comment, I may not have looked for that.

A couple confirmatory questions:
a. I read your code as "e(sample)" refers to the sample designated by "estimates store model_with". Is that correct?

b. I have been under the impression that I would want to compare what I am calling the unadjusted model (melogit y x) to the adjusted model (melogit y x a b c) so to compare the relationship of y and x with and without other covariates. So, with that understanding, if the full model I want to estimate looks like the one below, I would want to run the LR test as below. Am I understanding this as an appropriate way to apply the LR test? Under what conditions would I want to compare just y to y x a b c?

Code:

melogit readmit30 cohort4num $patient $hospital $market $hrrp $month $year || new_hosp_num:, vce(robust) or estimates store model_with melogit readmit30 cohort4num if e(sample) || new_hosp_num:, vce(robust) or

And last, the reason I'm trying to do the likelihood ratio test is because when I add hospital characteristics for number of beds, ownership, census region, teaching hospital, and a few others, the unobserved random variation between hospitals drops from around 0.003 to around 10^-33. I'm trying to understand and characterize that output a bit better, because I am not convinced that those factors should account for most of the unobserved random variation between hospitals (I was expecting higher variance). I'm not trying to justify using a single level method, but I also am not sure whether I'm missing something (and as a graduate student, I think this might be something I should understand a little bit). I am doing a number of sensitivity and specificity tests, but can you recommend other appropriate fit tests for melogit that I should be considering?

Thank you!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

14 Mar 2019, 22:41

a. I read your code as "e(sample)" refers to the sample designated by "estimates store model_with". Is that correct?

Well, it does, but only coincidentally. What e(sample) refers to is the last estimation command you have run. Even if you never stored those results, you could still refer to the estimation sample of the last estimation using e(sample).

So, with that understanding, if the full model I want to estimate looks like the one below, I would want to run the LR test as below. Am I understanding this as an appropriate way to apply the LR test?

Yes.

Under what conditions would I want to compare just y to y x a b c?

You would do that to get an omnibus (likelihood ratio) test of the joint significance of x, a, b, and c.

And last, the reason I'm trying to do the likelihood ratio test is because when I add hospital characteristics for number of beds, ownership, census region, teaching hospital, and a few others, the unobserved random variation between hospitals drops from around 0.003 to around 10^-33. I'm trying to understand and characterize that output a bit better, because I am not convinced that those factors should account for most of the unobserved random variation between hospitals (I was expecting higher variance).

Well, one's intuitions about which factors are or are not sufficient to account for the unobserved variation at some level in a model without covariates are not necessarily reliable. Sometimes simple structural factors can account for more of an outcome than seems intuitively right. Nevertheless, it never hurts to view your model skeptically and scrutinize it for possible errors. In all honesty, I'm not sure how a likelihood ratio test really helps you in that regard. Be that as it may, if you have doubts about the results you are getting, my first advice would be to double-check the data: are the distributions of all the model variables plausible/correct? What about joint distributions of pairs of variables? I'd be especially worried about faulty data when there has been an imputation process--depending on how the imputations were done there is a lot that can go wrong there.

Another thing is to check that your global macros aren't messing things up. They are an inherently unsafe programming practice and should rarely be used. This is not really a suitable situation for using them. This is because if there is a global macro with the same name in any other program running, they will clash and the other can overwrite yours, or the other way around. You won't always know what other programs might be running, or have run recently. So before running the regression using them, I would -display- them to make sure they still contain what you put into them and have not been surreptitiously changed. In the future, use local macros to hold lists of variables like this--they're not subject to this problem.

can you recommend other appropriate fit tests for melogit that I should be considering?

Yes, in addition to discrimination (sensitivity and specificity) look at calibration. In ordinary -logit- you can do the Hosmer-Lemeshow statistic. That isn't directly available after -melogit-, but you can "roll your own" with something like this:

Code:

melogit readmit30 cohort4num $patient $hospital $market $hrrp $month $year || new_hosp_num:, vce(robust) or predict readmit30_hat if e(sample), mu // CALCULATE DECILES OF PREDICTED RISK xtile decile = readmit30_hat, nq(10) // COLLAPSE OVER DECILES collapse (sum) readmit30 readmit30_hat, by(decile) // COMPARE # OF OUTCOMES PREDICTED & OBSERVED IN EACH DECILE gen difference = readmit30 - readmit30_hat list, noobs clean graph twoway scatter readmit30 readmit30_hat, sort || line readmit30_hat readmit30_hat

Note: There is nothing magical about using deciles here--it's convenient for moderate size samples. But if your sample is large, feel free to break it up into 20 or 100 or 1000, or any n, quantiles. As long as you have around 50 observations per quantile, this will give you reasonable fidelity.

In the graph produced by the above code, perfect calibration would have all of the scatter points falling exactly on the line. That pretty much never happens in real life, but the extent of departure from that gives you a sense of how close to or far from perfect your model is. It's also useful for spotting things like a pattern of overprediction in the middle and underprediction at the extremes (or vice versa) or things like that which might provide a clue about how to improve the model.

Last edited by Clyde Schechter; 14 Mar 2019, 22:44.
1 like
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#5

15 Mar 2019, 12:30

Thank you, Dr. Schechter,

In all honesty, I'm not sure how a likelihood ratio test really helps you in that regard.

At this stage in my learning, I think there is some of the "throw the kitchen sink at it" to see what makes sense and is useful, while still being careful to stick to my a priori assumptions and strategies. After having read more about the lrtest and your comments, I think you are correct, but I had wanted to explore it so I could be somewhat conversant.

I'd be especially worried about faulty data when there has been an imputation process--depending on how the imputations were done there is a lot that can go wrong there

Most of my variables had missingness of 0.5% or less and only one at 2.9%, all in hospital and market characteristics. I understand that one rule of thumb is that if missingness exceeds 5% it should be addressed. However, it was suggested to me that it would be good for me to learn how to impute missing values, so I applied it here even though I had lower rates of missing values.

First, I filled in as much as I could from the same hospital with a neighboring year of data. Then I used the mi impute mvn commands in wide form to impute 10 estimates, took the average of those 10 estimates, filled in only the remaining missing values with the averages, and then deleted the individual imputed estimates. For non-continuous variables, I rounded the value to the nearest whole number using 6 decimal places.

So, I guess the question (beyond whether this approach was appropriate) is whether small percentages of potentially faulty imputed data pose a higher risk to the analysis than do missing data? My guess is that the short answer is, "it depends."

Another thing is to check that your global macros aren't messing things up. They are an inherently unsafe programming practice and should rarely be used

I did not know this. Thank you. I think the way I have been applying them is more like a local macro---I apply them within each .do file separately and just before running the regressions-- but your point is well taken. I do run multiple windows of Stata at the same time, so I will switch to using local macros instead as a shortcut to store variables.

Yes, in addition to discrimination (sensitivity and specificity) look at calibration. In ordinary -logit- you can do the Hosmer-Lemeshow statistic. That isn't directly available after -melogit-, but you can "roll your own" with something like this: ...

Thanks for this suggestion! I will try this calibration and do a double-check on my distributions in the next couple days and let you know what I see.

Last edited by Dana Fletcher; 15 Mar 2019, 12:33.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

15 Mar 2019, 13:07

Missing data is always messy to deal with. There really aren't good solutions in most situations. While I have seen the rule of thumb you refer to, it is also true that very small amounts of missing data can meaningfully result in biased analyses. I will note that your procedure of copying values from the same hospital in nearby years is going to reduce within-hospital variation artificially. This will lead to inflation of coefficient estimates in regression models. The same is true of the procedure of generating 10 multiply imputed values and then just using the average of the 10--that violates the spirit of multiple imputation, and does so in a way that actually completely defeats its purpose. I can understand that one might be eager to avoid the difficulties that come with doing multiple imputation for real--but if that is the case, then I think it better to just abandon it altogether and use some other form of single imputation, or go to complete case analysis if you think the fraction of missing data is small enough and the missingness sufficiently close to uninformative. If your reluctance to use real multiple imputation is that you find the assumption of missingness at random untenable for your data, then I commend you for taking this assumption seriously--but in that case, averaging the results of 10 imputations doesn't get you around the problem!
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#7

15 Mar 2019, 14:01

If your reluctance to use real multiple imputation is that you find the assumption of missingness at random untenable for your data, then I commend you for taking this assumption seriously--but in that case, averaging the results of 10 imputations doesn't get you around the problem!

I wish I could say that I have done something commendable, but alas, I did make the assumption that missing at random applied to at least some of my missing data. My rationale for proceduralizing the way I did was based on the assumption that some imputation was better than none, and my procedure was based on what I thought was a pragmatic approach of dipping my toe in the waters of mi impute, without taking what seemed to be the extensive time needed to fully understand it. I see now that I actually did some kind of hybridized single imputation, but from only 10 values, which could actually be worse than a single imputation!

There were a few variables for which certain hospitals were missing data for all variables in that year --primarily for variables that were available only from the American Hospital Survey and not from CMS data. It was that scenario that prompted me to use the neighboring year for those variables at those hospitals.

It seems I will need to revisit this question. Thank you for taking the time to explain it more thoroughly!
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#8

17 Mar 2019, 10:33

@ClydeSchechter --Working through the above, but this exercise has introduced me to a few things about running models that I wasn't aware of before. Coincidentally, I have now learned more about -maximize- and what all this "not concave" and "backed up" stuff means... Previously I just looked for non-convergence and didn't even notice whether the last iteration had one of these messages. So, thanks for opening that door.

Some newbie observations about the LR test and local macros--

First, I usually run the 2 models used in the LR test separately and scrutinize the results. After doing that, it seems like the LR test is a way to shorten the final output (without all those scrutinized middle pieces) and--I'm not sure this description is totally accurate--but seems like its a way to summarize the difference between the two (I am not using it to assess random intercept/slope distinctions). But, like you wrote above, it doesn't answer the question about my hospital random error going essentially to zero--that answer is forthcoming.

Second, I was initially irritated that the local macro wouldn't just let me run each new segment of code using Cntrl-D without including the local macros... I've adjusted by commenting out my prior lines each time while i'm building the code, which works and I'm getting used to. Are there any other hints about how to do this efficiently while I'm building code a section at a time?

Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

17 Mar 2019, 13:00

Second, I was initially irritated that the local macro wouldn't just let me run each new segment of code using Cntrl-D without including the local macros... I've adjusted by commenting out my prior lines each time while i'm building the code, which works and I'm getting used to. Are there any other hints about how to do this efficiently while I'm building code a section at a time?

This is a common concern about using local macros, and the solution you have found is the one I use. You do have to be a little bit careful with it: if the code you comment out changes the value of the local macro, then this approach could lead to incorrect results. In that case you would have to re-define the local macro, setting it to an appropriate value, before the new code section. Fortunately, this doesn't happen all that often in practice. Most local macros that you initialize directly are constants (or unchanging lists of variables or values) you want to refer to throughout the code. The macros that change during execution are often under the control of a -foreach- or -forvalues- loop so you don't have to attend to them. There are exceptions to these generalizations, of course, but, fortunately, they are not frequent.

Another thing you can do with local macros that are just holding constants (including unchanging lists of variables) is just copy the original definition and paste it in before each section of code that uses it. This may be more convenient than commenting out code that comes between. It is minimally wasteful of compute cycles to do this, but not enough to be noticeable and it can save a lot of programmer time, which is more valuable. This also has the benefit of making a read-down of the code easier because you don't have to search far up the code to find the definitions. The downside is that if you subsequently change the definition, you must be careful to change it in every instance. (That's enough of a downside that I seldom use this method.)
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#10

18 Mar 2019, 23:34

Ok, here goes.... many answers as a result of the prior suggestions, but also more questions!

Another thing you can do with local macros that are just holding constants (including unchanging lists of variables) is just copy the original definition and paste it in before each section of code that uses it

I did try that as well and didn't like it as much--I slightly changed one of the variable names and then had to remember to change the local macro again in multiple places. It could serve well, but for my current purposes, I like the comment out option.

In the graph produced by the above code, perfect calibration would have all of the scatter points falling exactly on the line. That pretty much never happens in real life, but the extent of departure from that gives you a sense of how close to or far from perfect your model is. It's also useful for spotting things like a pattern of overprediction in the middle and underprediction at the extremes (or vice versa) or things like that which might provide a clue about how to improve the model.

I am still working on the multiple imputation--see the next question below--but the calibration code worked beautifully. I've run it with my original hybrid imputation, with all missings unfilled, and here, where I've been able to clean my data further such that i have only 2 variables remaining with any missings: whether a hospital has rehab onsite (2.49% missing) and whether a hospital is part of a larger system (0.66% missing), both are from the AHA survey database and I don't have any other sources to cross reference the missing fields. When I used my original hybrid imputation strategy, the dot on the far left wasn't there (b/c there were no missings). Other than that, all the plots look like this, so I'm wondering if this means that the missingness will not be all that informative.

I can understand that one might be eager to avoid the difficulties that come with doing multiple imputation for real--but if that is the case, then I think it better to just abandon it altogether and use some other form of single imputation, or go to complete case analysis if you think the fraction of missing data is small enough and the missingness sufficiently close to uninformative.

That brings me to the difficulties you mention above relative to multiple imputation. I must be missing something fundamental, but I cannot figure out how to get from the mi impute command to getting those imputations into my dataset in a manner that I can then run my melogit models.

I have done the following with both chained and mvn. No variables are correlated at p>0.4 with the two that are missing.

Code:

*IMPUTE MISSING FOR ONSITE mi set wide mi misstable summarize onsite hospital_system mi misstable patterns onsite hospital_system pwcorr cohort4num readmit30 pct_hosp_rate race female agecat /// event_age10 cci indication esrd disabled bedsize /// ownership specialty rural university region onsite /// pct_medicare_share uninsured map HHI penalty hospital_system, obs mi register imputed onsite hospital_system mi xtset, clear mi impute chained (logit)onsite (logit)hospital_system , add(10) rseed (53421)

I initially thought that I could get a single number to fill in to the missing slots, but am now wondering if I'm really off base because the instructions I have seen just say to run mi estimate: regress after this step. I did this, but it still wasn't clear to me how to get to that completed list of my two variables to run in melogit. Could you shed any light on this gap in my understanding?

...my first advice would be to double-check the data: are the distributions of all the model variables plausible/correct? What about joint distributions of pairs of variables?

I've narrowed it down to census region. I can add any of the other variables and the variance of the hospital random intercept stays around 0.001. When I add the census region, it jumps down to around 10^-33.

Thanks!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

19 Mar 2019, 10:58

When I used my original hybrid imputation strategy, the dot on the far left wasn't there (b/c there were no missings). Other than that, all the plots look like this, so I'm wondering if this means that the missingness will not be all that informative.

I don't understand how the restoration of missing values has resulted in that phenomenon. That point on the calibration plot says that there are plenty of readmissions actually taking place among people whom you predict will not be readmitted. So apparently your earlier imputation strategy more correctly predicted them. Fair enough. But then when you stepped back from the earlier imputation strategy, what did you do instead? Did you just drop the cases with missing values from estimation? Or did you impute them in some other way? If so, how? I don't get what analysis the graph you are showing represents.

I initially thought that I could get a single number to fill in to the missing slots, but am now wondering if I'm really off base because the instructions I have seen just say to run mi estimate: regress after this step.

As the name multiple imputation suggests, it does not provide a single number to use in place of a missing value. Multiple imputation creates multiple data sets, each with its own substitutes for the missing values. That is in turn followed by carrying out the desired analysis on each of those multiple data sets, and then combining the results using what have come to be known as Rubin's rules. This analysis/combining procedure is what -mi estimate- does. But in your case, it should not be -mi estimate: regress- It should be -mi estimate: melogit ...- (fill in the rest of your -melogit- command). Stata will respond by doing 10 -melogit-s, one on each of your 10 added data sets, and then it will show you the combined results as a single set of outputs. You can then use that output the same way you would as if it just came from a single -melogit- on a complete data set.

Given the very small amount of missing data you have, I think that if your multiple imputation results come out pretty close to the results you get from a complete cases only analysis, you can conclude that missingness is unlikely to have been informative.

I've narrowed it down to census region. I can add any of the other variables and the variance of the hospital random intercept stays around 0.001. When I add the census region, it jumps down to around 10^-33.

That surprises me. It implies that if you specify the census region (in addition to your other hospital descriptors) you don't really need to know anything more about the hospital to predict the readmission rate. Census regions are very large geographic areas that contain thousands of hospitals. I would think that there would be a lot of remaining between-hospital variation. Perhaps this is arising from the richness of your other hospital descriptors. Maybe you have so many other hospital level variables that those variables plus census region almost uniquely identify hospitals? What happens if you retain census region in the model but drop a few of the other hospital-level variables?
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#12

20 Mar 2019, 11:27

That surprises me. It implies that if you specify the census region (in addition to your other hospital descriptors) you don't really need to know anything more about the hospital to predict the readmission rate. Census regions are very large geographic areas that contain thousands of hospitals. I would think that there would be a lot of remaining between-hospital variation. Perhaps this is arising from the richness of your other hospital descriptors. Maybe you have so many other hospital level variables that those variables plus census region almost uniquely identify hospitals? What happens if you retain census region in the model but drop a few of the other hospital-level variables?

Yes, this is exactly the reason I have been skeptical (although perhaps overthinking it). What started me down this path was that I had been trying to run average marginal effects (margins, dydx(*)), but it kept giving the error that suggested the model was too complex for the data. So, I removed and added back variables to try to identify the offending one(s). I ultimately realized that it wasn't WHICH variable, but HOW MANY. So, I switched to an output of odds ratios because I could still interpret them and they ran without error rather than dropping any of the non-significant explanatory factors. But I remained skeptical, noted the change in unobserved variation,and decided to investigate further. So, I did a similar thing, but perhaps not as granular, with the variation, to come to the conclusion that census region was explaining most of the variation. However, I just now redid the exercise of removing and adding back variables to try to determine whether its an individual or group of ones or whether it is how many I include.

In this model, I am evaluating the association of my explanatory variables with readmission, not the added effect of my treatment variables of interest (cohort4num and hospital_rate). These are estimated as complete cases, so with 176 records excluded (out of ~33k)

Code:

***************************************************************************** local patient "i.race i.female ib(freq).age c.cc i.indication i.esrd i.disabled " local hospital "c.bed_cnt100 ib(freq).ownership i.university ib(freq).region i.onsite i.system i.specialty i.rural " local market " c.uninsured c.medicare_share c.map c.HHI" local hrrp "i.penalty " local season "i.February i.March i.April i.May i.June i.July i.August i.September i.October i.November i.December" local year "i.yr2013 i.yr2014" ******************************************************************************** melogit readmit30 `patient' `hospital' `market' `hrrp' `season' `year'|| new_hosp_num:, vce(robust) or

Region, penalty, and university have consistently been the best hospital-level predictors of readmission in this model. If I remove those three variables from the model, the variance changes from a magnitude of 10^-33 to 0.001. If I remove any of the other variables, this change doesn't occur, so it seems these are the important ones in explaining between-hospital variation in this model. If I add any one of them back individually, the variance remains in the 0.001 magnitude. However, if I add them back in any combination of twos, the variation is close to zero.

So, now, a long way from my original post, I find myself wondering which is more important for me to consider: parsimony--simplifying my model by excluding some of the explanatory variables that i had originally included based on my conceptual model and availability, but that have not been significant in this model (like all of the market variables, specialty, and system). Or explaining as much of the between hospital variation as I can. Or some combination of the two strategies.

In this case, I am trying to control for the unobserved random between-hospital variation, not to study it, per se , so my sense is that it is ok to leave the model as I originally had it and just note that the model explains most of the unobserved between-hospital variation. Is that an appropriate conclusion from this exercise?

Last edited by Dana Fletcher; 20 Mar 2019, 11:37.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

20 Mar 2019, 11:59

Well, the question of what to include in a model is always a difficult one.

The first question is what your goal is. If you want a model that tightly fits the data, then you tend to include more variables. Of course, if you include too many, you can overfit the noise, and then your model will fall apart when applied to a different data set. There is a rule of thumb about wanting more than 50 observations for each variable (and a categorical variable with n levels counts as n-1 variables) is a little harder to apply in a multi-level model. Probably for things like person-level variables, you want a ratio of 50 people per variable. So that is a caution about including too many variables--but in large enough data sets this is not the worry.

But if you want a model that sharpens your understanding of the real-world data generating process, then parsimony may be more important. Here there are no statistical rules of thumb.

Some investigators like to trade off parsimony with fit by looking at the AIC or BIC (run -estat ic- after each model you want to compare).

The one approach I really advise against is including variables based on their statistical significance.
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#14

20 Mar 2019, 12:34

There is a rule of thumb about wanting more than 50 observations for each variable (and a categorical variable with n levels counts as n-1 variables) is a little harder to apply in a multi-level model.

Yes, I don't think I have an issue with this. Thank you for providing that rule of thumb and its caveats!

The one approach I really advise against is including variables based on their statistical significance.

Exactly the reason I have been hesitant to remove variables--there are three that i think are less important from my conceptual model: hospital system, rural, and medicare share of hospital discharges. However, since I have now seen them as also being non-significant, my rationale for excluding them is no longer un-biased. I guess the lesson here is how important it is to start from the conceptual model with the most parsimony, i.e., if you don't have a really good reason for including a variable a priori, don't include it, rather than using it because it might be useful, or because it's available when others, perhaps of higher conceptual significance are not (more of the kitchen-sink problem).

I will play with estat ic.

Thanks!
Comment
Dana Fletcher

Join Date: Apr 2018

Posts: 20
#15

20 Mar 2019, 18:59

But in your case, it should not be -mi estimate: regress- It should be -mi estimate: melogit ...- (fill in the rest of your -melogit- command). Stata will respond by doing 10 -melogit-s, one on each of your 10 added data sets, and then it will show you the combined results as a single set of outputs. You can then use that output the same way you would as if it just came from a single -melogit- on a complete data set.

Given the very small amount of missing data you have, I think that if your multiple imputation results come out pretty close to the results you get from a complete cases only analysis, you can conclude that missingness is unlikely to have been informative

Aha! this is where I got stuck before, thank you! I did try mi estimate:melogit but Stata told me it was not a valid command. mi estimate:meqrlogit does run, although it doesn't appear to be compatible with vce(robust) and it takes several hours. It also appears to be finicky with running the calibration test after it. I tried rerunning melogit after mi estiamte, but Stata said the data are mi set. I tried mi fvset and mi export. None of those would allow me to predict the residuals to set up for the graph. However, the sign, magnitude, and significance of the coefficients align with my prior models that were not actually multiply imputed. So, I think I will go with complete case or my [incorrect] prior hybrid method since there isn't much information to gain from the multiple imputations. This has been a very helpful exercise in getting a rudimentary understanding of multiple imputations for missings. Thank you very much!!

I don't understand how the restoration of missing values has resulted in that phenomenon. That point on the calibration plot says that there are plenty of readmissions actually taking place among people whom you predict will not be readmitted. So apparently your earlier imputation strategy more correctly predicted them. Fair enough. But then when you stepped back from the earlier imputation strategy, what did you do instead? Did you just drop the cases with missing values from estimation? Or did you impute them in some other way? If so, how? I don't get what analysis the graph you are showing represents.

In the graph I attached previously in #10, when I stepped back I did a couple things. First, after reading something about doing as much data cleaning as possible before imputing, I cross-referenced my missings with another dataset and found that 3 of the variables that had missings were due to a single county. I averaged the data from two nearest adjacent counties by year and filled in the data for those variables. I could infer a couple other missing variables from the data (e.g., where I was missing reason for entitlement, but had age). Then, there were several hospitals for which I didn't have data for two key exclusion criteria and I could not find these data in other sources. Because they were key exclusions relative to what hospitals should be in my sample, I decided it might more accurate to exclude them, than to impute wrongly (as it turns out for this dataset, it probably didn't make much difference).

After that, I ran my models with the remaining missings intact, which were two variables with .66% and 2.5% missing, and ran melogit in default (complete case). The graphs below were the outcome. On the left (or top depending on their presentation) is the model with only my dependent and key intervention variables. On the right (or bottom) is the fully adjusted model with all the explanatory variables described in #12 above. I was pleased that there was a difference!

This has been super helpful, and thank you so much for your patience and thorough counsel!
Comment

Announcement

Another error with lrtest after melogit

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment