No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using the "VARNAME N" feature in GLM family(binomial) estimator

    Dear Stata Users,
    This is my first post so I apologize in advance if I missed anything in the protocol for posting.

    I am estimating a binary patient-level outcome variable CR in an interrupted time series that will be interpreted as a hospital-level effect, and am using an indicator variable for policy implementation, HRRP for patient data nested by hospital (patient i at hospital j) and am using the following stata code: glm cr i.hrrp c.elapsed_qtr i.hrrpXtime2 $season $patient $hospital $market , vce(cl new_hosp_num) family(binomial binomial_denominator_variable) link(logit), where elapsed_qtr and the interaction term are for the post-period time series, the $patient are patient-level characteristics and $hospital and $market are hospital-level characteristics, and new_hosp_num is the identifier for hospital j.

    My post is about this binomial_denominator_variable.

    The Stata manual says this about the binomial_denominator_variable, which it calls varnameN:

    "The binomial distribution can be specified as 1) family(binomial), 2) family(binomial #N ), or 3) family(binomial varnameN ). In case 2, #N is the value of the binomial denominator N, the number of trials. Specifying family(binomial 1) is the same as specifying family(binomial). In case 3, varnameN is the variable containing the binomial denominator, allowing the number of trials to vary across observations."

    From two earlier posts, including one by Clyde Schechter on 25 May 2016 ("GLM and blogit for proportion variable:different results") and one by Nick Cox on 19 Aug 2014 ("Entropy measure DV in panel data: Best regression technique?"), I infer that in case 3, varnameN should be the denominator of the dependent variable.

    First, operationally:

    1. What is Stata actually doing when VarnameN is included in this manner? For example, is it taking the average across each varnameN group?

    Second, in my example:

    2. I have the numerator CR{1}, the denominator, CR{0,1} for total number of patients, and the frequency of CR at each hospital (so I don't need to use fracreg or betareg where the denominator isn't available). I would like to have stata take into account the hospital clusters in my output. Note I am including standard errors clustered at the hospital level.

    3. To use this function correctly, would I use:
    varnameN=total number of hospitals in the cohort over all time periods?
    varnameN=total number of patients at each hospital over all time periods?
    varnameN=total number of hospitals for each quarter?
    varnameN=total number of patients at each hospital for each quarter?

    (some of the results change substantially depending on which is used, mostly for the time series and hospital-level variables)

    4. Would it be more appropriate to leave varnameN blank and interpret the final patient-level coefficients for a given hospital with a certain number of patients?

    5. I have not tried a mixed effects approach yet, as it was not something my advisers were keen on, but would that be a useful approach given the multi-level nature of the data?

    Thank you!
    Last edited by Dana Fletcher; 03 Oct 2018, 21:34. Reason: accidentally hit tab (or something), which inadvertantly posted prematurely.

  • #2
    When you estimate a model like -glm success_count predictor_variables, link(log) family(binomial opportunity_count)-, Stata estimates all the model parameters by maximum likelihood. In this particular model, the likelihood function is based on the binomial probability of success_count successes in opportunity_count failures conditional on the predicted probability (whose logarithm is the linear combination of the predictor_variables.) I don't really know what you mean by taking an average across each varnameN group, so I can't say if this is that or not. I can say that were you to expand each observation with opportunity_count copies, of which success_count had a 1 outcome and the rest had a 0 outcome and then ran -glm ..., link(logt) family(binomial)- you would get exactly the same results (but it would take a lot longer!).

    I would definitely look into doing this with a mixed effects model. In fact, before I encountered your last question, I kept asking myself "why isn't she using -meglm- for this?" I don't know why your supervisors are reluctant to go down that path--can they explain it? It seems far and away the most appropriate way to model this naturally nested data.

    The choice of which variable to use as the denominator is a matter of what unit presents an opportunity for a success (positive outcome). Think about it this way. Suppose you were just doing -glm, family(binomial)- on a data set without any grouping or aggreegation of the data: you just have one observation for every event in the data set. You can aggregate the data by grouping together all observations that have identical values on all of the model variables (except the outcome). Then you can replace all of the observations in that group with a single new observation that contains those covariate values and also contains two new variables: one is the count of the number of observations in that group and serves as your denominator, and the other is the number of such observations that had a positive outcome in the group. In fact, you can create this aggregated data set with -collapse-.

    // glm y x1 x2 x3 x4, link(log) family(binomial)
    // ASSUMPTION: y is coded 0/1
    collapse (count) denominator = y (sum) numerator = y, by(x1 x2 x3 x4)
    glm numerator x1 x2 x3 x4, link(log) family(binomial denominator)
    Both will give you the same resuts. The second one will run faster if the number of distinct combinations of x1, x2, x3, and x4 is substantially smaller than the number of observations in the original data set.

    This also should make it clear to you what the denominator is.

    Finally, I don't understand question 4.


    • #3
      Thank you for that explanation!

      1. The example of taking the average in my first question was just an example and perhaps not even relevant to the question. I will try to work out the logic of how maximum likelihood would be programmed manually to understand what it is doing behind the scenes.

      2. I think I see how to use the denominator function now.

      The way I was using it was:Glm y x1 x2 x3 x4, link(logit) family(binomial denominator), with the intention of telling Stata that for all y=1 in the cluster defined by denominator, the value of denominator j is the total number of observations within cluster j.

      However, the correct usage seems to be that in order to use denominator, y should be re-parameterized to #successes_of_y , to tell Stata the actual value of the numerator for the total number of observations in the groups designated by denominator.

      I was assuming the command was doing more than what it actually appears to be doing. I had avoided using collapse because I thought I wold lose the granularity of my patient-level variables by taking the mean, but I did not try including them as part of a by(*) statement as I thought I had too many. I will take a look at doing that and see what happens.

      3. Your explanation answered my 4th question because what I meant there was: glm y x1 x2 x3 x4, link(logit) family(binomial), which you explained is the correct syntax for using y=0/1 as the outcome variable.

      4. I think the lack of recommendation to use mixed-effects may have been based on 2 things: personal preference and a level of suspicion that mixed-effects models might not have a full vetting
      (but I am not sure of this) but my reading keeps leading me back to mixed-effects. I have the Gelman and Hill, 2007, book on multi-level and hierarchical models, and am looking into some other references. Can you suggest other references for pros and cons of using melogit, specifically, beyond the Stata manual?

      Many thanks for your help (and very prompt response)!


      • #4
        There is a great deal of detail about -melogit- in Rabe-Hesketh S, Skrondal A. Multilevel and Longitudinal Modeling Using Stata, 3rd ed., Volume II. Stata Press, College Station, TX 2012. (It was written for earlier versions of Stata, so some of the syntax is out of date. For, example, they refer to the command -xtmelogit-, which is now -melogit- or -meqrlogit-.) As for a justification of when to use multi-level modeling, see Volume I, in the very first chapter where they discuss this at length. The main argument is that with hierarchical data like yours, the assumption of independence among observations, which is essential for valid estimation by maximum likelihood, fails when you use one-level estimators like -glm-.


        • #5
          I will definitely check out those references. Thanks again for sharing your insight and expertise!