Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multivariate analyses with small sample size and count data

    Hello,

    This is my first forum post. As a quick introduction, I'm a master's student in public health working on my thesis project, and I'm very excited to have found out about the statalist forum. Thank you to everyone for letting me share my questions!

    My current project examines the association between illegal wildlife trade shipments (as a proxy for human-wildlife contact) and zoonotic disease transmission (transmission from animal to human). I am using Ebola as an ecological case study (country-level), and I am limited to a very small sample size of n=32 countries who have known or predicted geographic distribution of Ebola virus. My outcome variable is index cases of Ebola hemorrhagic fever, of which only seven countries have reported index cases. My primary exposure variable is illegal wildlife shipment of host mammals (i.e., mammals capable of hosting the disease pathogen) by country of origin. Other covariates include human population density, forest area (% of land area), health expenditure per capita, etc.

    So far, I have run a series of Poisson regression models using a stepwise process to identify significant variables. The most parsimonious model and model of best fit based on AIC/BIC includes the independent variables wildlife trade shipments, human population density (log-transformed), and % forest area. Using these variables, I conducted a negative binomial regression model to account for over dispersion and excess zeroes, results of which produced no significant associations, but much lower AIC/BIC values. I also explored a zero-inflated Poisson model (inflating the variable population density), which produced the lowest AIC/BIC values and a significant Vuong's test z-statistic (p<0.05), which supposedly indicates that the zero-inflated is preferable to the standard Poisson model. This zero-inflated model produced all statistically significant associations, although the inflated variable, pop. density, showed an opposite effect on the two processes (certain zero vs. non-certain zero countries).

    I'm aware that it's not recommended to use Poisson, Negative Binomial, ZINB, and ZIP on small sample sizes. I'm not aware, however of any alternative models.

    Are there alternatives regression models for count variables limited by a very small sample size? Or are there ways to validate the models given that my data does not meet certain model assumptions? I'm new to the concepts of cross-validation and bootstrapping, but from what I've read, k-fold cross-validation requires a large sample size, but is often the preferred method of model validation.

    I apologize as I'm quite new to statistics, but any recommendations on how to incorporate more robust methods in multivariate analyses (given a small sample size) would be greatly appreciated!

    Thank you very much,

    Katie
    Last edited by Katie Tseng; 10 Apr 2017, 22:07.

  • #2
    Try playing with transformations of the data. For example, if there is some threshold of the count variable which is important, turn it into a dummy variable (0/1) based on whether the count is above or below the threshold.

    Or perhaps proper specification is your issue and you need to include more covariates to better explain the variation in the outcome variable.

    Regarding the more general issue of "very small sample size" -- no matter what regression type you use, make sure you bootstrap the standard errors.

    Comment


    • #3
      On a little different note, with N of 32, you should not be running all this stuff. You've probably estimated more parameters than you have data. You're almost certain to be over fitting the data.

      Regression and Poisson regression are pretty robust. Maybe follow Chris's suggestion to make the outcome 0/1 and use logit. Most of the fancier techniques only have asymptotic properties. With small samples, keep it simple.

      Comment


      • #4
        Katie:
        welcome to the list.
        About [(number of parameter^2)/sample] ratio, see also Joao' s reply #5 in: http://www.statalist.org/forums/foru...eroscedstisity
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thank you Chris, Phil, and Carlo for your response!

          I will try (1) transforming my outcome to binary and runing a logit model, (2) bootstrapping to better estimate SEs and CIs, and (3) keeping my model simple - fewer parameters, the better (k2/n).

          Many thanks, again!

          Katie

          Comment


          • #6
            Hello again,

            I've tried bootstrapping my SEs and CIs to my zero-inflated Poisson model using *vce(bootstrap, reps(100))*. However, my bootstrapped CI's are extremely wide (in the thousands) and perhaps nonsensical for the incident rate ratio of my dependent and my primary independent variable (IRR=1.5). Is this likely a result of having a small sample size and excess zeroes in my primary exposure and outcome? I'm also confused as to how to interpret the bootstrapped CIs given that the 32 countries in my analysis are not necessarily samples of a larger population of countries, but rather, they represent all countries at risk for Ebola/Marburg virus disease. Would I simply interpret the bootstrap intervals as more accurately reflecting the true parameters of my model?

            This is my first time running a bootstrap, so apologies for the confusion. Any feedback would be appreciated! Thank you!

            Katie

            Comment


            • #7
              Katie:
              it is difficult to say whether your intuitioins are corrected withou seeing what you typed and what Stata gave you back (as per FAQ). Thanks.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment

              Working...
              X